Why does Java permit escaped unicode characters in the source code? - java

I recently learned that Unicode is permitted within Java source code not only as Unicode characters (eg. double π = Math.PI; ) but also as escaped sequences (eg. double \u03C0 = Math.PI; ).
The first variant makes sense to me - it allows programmers to name variables and methods in an international language of their choice. However, I don't see any practical application of the second approach.
Here are a few pieces of code to illustrate usage, tested with Java SE 6 and NetBeans 6.9.1:
This code will print out 3.141592653589793
public static void main(String[] args) {
double π = Math.PI;
System.out.println(\u03C0);
}
Explanation: π and \u03C0 are the same Unicode character
This code will not print out anything
public static void main(String[] args) {
double π = Math.PI; /\u002A
System.out.println(π);
/* a comment */
}
Explanation: The code above actually encodes:
public static void main(String[] args) {
double π = Math.PI; /*
System.out.println(π);
/* a comment */
}
Which comments out the print satement.
Just from my examples, I notice a number of potential problems with this language feature.
First, a bad programmer could use it to secretly comment out bits of code, or create multiple ways of identifying the same variable. Perhaps there are other horrible things that can be done that I haven't thought of.
Second, there seems to be a lack of support among IDEs. Neither NetBeans nor Eclipse provided the correct code highlighting for the examples. In fact, NetBeans even marked a syntax error (though compilation was not a problem).
Finally, this feature is poorly documented and not commonly accepted. Why would a programmer use something in his code that other programmers will not be able to recognize and understand? In fact, I couldn't even find something about this on the Hidden Java Features question.
My question is this:
Why does Java allow escaped Unicode sequences to be used within syntax?
What are some "pros" of this feature that have allowed it to stay a part Java, despite its many "cons"?

Unicode escape sequences allow you to store and transmit your source code in pure ASCII and still use the entire range of Unicode characters. This has two advantages:
No risk of non-ASCII characters getting broken by tools that can't handle them. This was a real concern back in the early 1990s when Java was designed. Sending an email containing non-ASCII characters and having it arrive unmangled was the exception rather than the norm.
No need to tell the compiler and editor/IDE which encoding to use for interpreting the source code. This is still a very valid concern. Of course, a much better solution would have been to have the encoding as metadata in a file header (as in XML), but this hadn't yet emerged as a best practice back then.
The first variant makes sense to me -
it allows programmers to name
variables and methods in an
international language of their
choice. However, I don't see any
practical application of the second
approach.
Both will result in exactly the same byte code and have the same power as a language feature. The only difference is in the source code.
First, a bad programmer could use it
to secretly comment out bits of code,
or create multiple ways of identifying
the same variable.
If you're concerned about a programmer deliberately sabotaging your code's readability, this language feature is the least of your problems.
Second, there seems to be a lack of support among IDEs.
That's hardly the fault of the feature or its designers. But then, I don't think it was ever intended to be used "manually". Ideally, the IDE would have an option to have you enter the characters normally and have them displayed normally, but automatically save them as Unicode escape sequences. There may even already be plugins or configuration options that makes the IDEs behave that way.
But in general, this feature seems to be very rarely used and probably therefore badly supported. But how could the people who designed Java around 1993 have known that?

The nice thing about the \u03C0 encoding is that it is much less likely to be munged by a text editor with the wrong encoding settings. For example a bug in my software was caused by the accidental transformation from UTF-8 é into a MacRoman é by a misconfigured text editor. By specifying the Unicode codepoint, it's completely unambiguous what you mean.

The \uXXXX syntax allows Unicode characters to be represented unambiguously in a file with an encoding not capable of expressing them directly, or if you want a representation guaranteed to be usable even in the lowest common denominator, namely an 7-bit ASCII encoding.
You could represent all your characters with \uXXXX, even spaces and letters, but there is rarely a need to.

First, thank you for the question. I think it is very interesting.
Second, the reason is that the java source file is a text that can use itself various charsets. For example the default charset in Eclipse is Cp1255. This endoding does not support characters like π. I think that they thought about programmers that have to work on systems that do not support unicode and wanted to allow these programmers to create unicode enabled software. This was the reason to support \u notation.

The language spec says why this is permitted. There might be other unstated reasons, and unintended benefits and consequences; but this provides a direct answer to the question (emphasis mine):
A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps, which are applied in turn:
A translation of Unicode escapes (§3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the UTF-16 code unit whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters.
...

Related

How to create my own unique Charset in Java?

I would like to make my own Charset in Java and then use it for the encoding purpose.
I need to add some particular symbols to my Charset as well as all of the numbers and 4 languages (Traditional Chinese, US English, Polish and Russian).
I tried to browse Charset class but didn`t really find a solution.
Basil's answer explains that you don't need to define a custom Charset in order to support some non-standard symbols.
But if you really do need to do it, you will have to write a custom class that extends Charset. There are 3 abstract methods that you have to implement:
boolean contains(Charset cs) - Tells whether or not this charset contains the given charset.
CharsetDecoder newDecoder() Constructs a new decoder for this charset.
CharsetEncoder newEncoder() Constructs a new encoder for this charset.
The other methods in the Charset API most likely don't need to be overridden.
The decoder and encoder need to be able to convert between a ByteBuffer containing text in your charset's encoding and Unicode codepoints in a CharBuffer. While both CharsetDecoder and CharsetEncoder are also abstract classes, they require you to implement a decodeLoop or encodeLoop method (respectively) which has complicated requirements.
I am not aware of any specific documentation or tutorials on how to implement a custom Charset and its CharsetDecoder and CharsetEncoder class. But you should be able to find example code in the OpenJDK Java SE codebase. (They will be internal classes ...)
I tried to browse Charset class but didn't really find a solution.
Well the "solution" is that you will need to study existing examples ... or conclude that you don't need to solve this problem at all. See above.
Private Use Areas within Unicode
You’ve not really explained what goal you are trying to achieve, but likely there is no need to invent either:
a character set (a collection of numbers each assigned to a particular character)
a character encoding (a way to represent instances of those numbers as bits and bytes).
Unicode defines over 144,000 characters, each assigned a number from a range of zero to just over a million. That leaves large gaps of numbers unassigned. Some of those empty sub-ranges are reserved for future use. But, of interest to you, some of those sub-ranges are set aside for “private use”, never ever to be assigned to a character by the Unicode Consortium. See Wikipedia.
👉 You are free to assign any meaning you wish to any number within those “private use areas”. So that works as your character set.
👉 As for your character encoding, using UTF-8 is almost always best. This is true for several reasons, as discussed here.
Java supports all of Unicode. So no extra programming needed to support your characters. Everything works the same whether encountering characters from inside or outside the private use areas.
If you want to involve other people in your endeavor, or want to share documents, then you should be aware that there is an unofficial registry of characters assigned to Private Use numbers. This unofficial registry is a volunteer effort, made outside of the Unicode Consortium. This registry is for characters that would never be accepted for inclusion in Unicode. This includes imaginary languages such as Klingon from Star Trek. When selecting code point numbers for your characters, you may want to avoid these unofficially registered code points.

How should I specify Asian char, and String, constants in Java?

I need to tokenize Japanese sentences. What is best practices for representing the char values of kana and kanji? This is what I might normally do:
String s = "a";
String token = sentence.split(s)[0];
But, the following is not good in my opinion:
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
because people who read my source might not be able to read, or display, Japanese characters. I'd prefer to not insult anyone by writing the actual character. I'd want a "romaji", or something, representation. This is an example of the really stupid "solution" I am using:
char YaSmall_hira_char = (char) 12419; // [ゃ] <--- small
char Ya_hira_char = (char) 12420; // [や]
char Toshi_kj_char = (char) 24180; // [年]
char Kiku_kj_char = (char) 32862; // [聞]
That looks absolutely ridiculous. And, it's not sustainable because there are over 2,000 Japanese characters...
My IDE, and java.io.InputStreamReaders, are all set to UTF-8, and my code it working fine. But the specter of character encoding bugs are hanging over my head because I just don't understand how to represent Asian characters as chars.
I need to clean-up this garbage I wrote, but I don't know which direction to go. Please help.
because people who read my source might not be able to read, or display, Japanese characters.
Then how could the do anything useful with your code when dealing with such characters is an intergral part of it?
Just make sure your development environment is set up correctly to support these characters in source code and that you have procedures in place to ensure everyone who works with the code will get the same correct setup. At the very least document it in your project description.
Then there is nothing wrong with using those characters directly in your source.
I agree that what you are currently doing is unsustainable. It is horribly verbose, and probably a waste of your time anyway.
You need to ask yourself who exactly you expect to read your code:
A native Japanese speaker / writer can read the Kana. They don't need the romanji, and would probably consider them to be an impediment to readability.
A non Japanese speaker would not be able to discern the meaning of the characters whether they are written as Kana or as romanji. Your effort would be wasted for them.
The only people who might be helped by romanji would be non-native Japanese speakers who haven't learned to read / write Kana (yet). And I imagine they could easily find a desktop tool / app for mapping Kana to romanji.
So lets step back to your example which you think is "not good".
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
Even to someone (like me) who can't read (or speak) Japanese, the surface meaning of that code is clear. You are splitting the String using a Japanese character as the separator.
Now, I don't understand the significance of that character. But I wouldn't if it was a constant with a romanji name either. Besides, the chances are that I don't need to know in order to understand what the application is doing. (If I do need to know, I'm probably the wrong person to be reading the code. Decent Japanese language skills are mandatory for your application domain!!)
The issue you raised about not being able to the display the Japanese characters is easy to solve. The programmer simply needs to upgrade his software that can display Kana. Any decent Java IDE will be able to cope ... if properly configured. Besides, if this is a real concern, the proper solution (for the programmer!) is to use Java's Unicode escape sequence mechanism to represent the characters; e.g.
String s = String.valueOf('\uxxxx'); // (replace xxxx with hex unicode value)
The Java JDK includes tools that can rewrite Java source code to add or remove Unicode escaping. All the programmer needs to do is to "escape" the code before trying to read it.
Aside: You wrote this:
"I'd prefer to not insult anyone by writing the actual character."
What? No Westerner would or should consider Kana an insult! They may not be able to read it, but that's not an insult / insulting. (And if they do feel genuinely insulted, then frankly that's their problem ... not yours.)
The only thing that matters here is whether non-Japanese-reading people can fully understand your code ... and whether that's a problem you ought to be trying to solve. Worrying about solving unsolvable problems is not a fruitful activity.
Michael has the right answer, I think. (Posting this as an Answer rather than a Comment because Comment sizes are limited; apologies to those who are picky about the distinction.)
If anyone is working with your code, it will be because they need to alter how Japanese sentences are tokenized. They had BETTER be able to deal with Japanese characters at least to some degree, or they'll be unable to test any changes they make.
As you've pointed out, the alternatives are certainly no more readable. Maybe less so; even without knowing Japanese I can read your code and know that you are using the 'あ' character as your delimiter, so if I see that character in an input string I know what the output will be. I have no idea what the character means, but for this simple bit of code analysis I don't need to.
If you want to make it a bit easier for those of us who don't know the full alphabet, then when referring to single characters you could give us the Unicode value in a comment. But any Unicode-capable text editor ought to have a function that tells us the numeric value of the character we've pointed at -- Emacs happily tells me that it's #x3042 -- so that would purely be a courtesy to those of us who probably shouldn't be messing with your code anyway.

Regex: what is InCombiningDiacriticalMarks?

The following code is very well known to convert accented chars into plain Text:
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll
1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)
Thanks.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.
For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.
Blocks are nearly never what you want.
In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.
That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.
You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.
Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.
However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.
Took me a while, but I fished them all out:
Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])
Hope this saves you some time.

How to add features missing from the Java regex implementation?

I'm new to Java. As a .Net developer, I'm very much used to the Regex class in .Net. The Java implementation of Regex (Regular Expressions) is not bad but it's missing some key features.
I wanted to create my own helper class for Java but I thought maybe there is already one available. So is there any free and easy-to-use product available for Regex in Java or should I create one myself?
If I would write my own class, where do you think I should share it for the others to use it?
[Edit]
There were complaints that I wasn't addressing the problem with the current Regex class. I'll try to clarify my question.
In .Net the usage of a regular expression is easier than in Java. Since both languages are object oriented and very similar in many aspects, I expect to have a similar experience with using regex in both languages. Unfortunately that's not the case.
Here's a little code compared in Java and C#. The first is C# and the second is Java:
In C#:
string source = "The colour of my bag matches the color of my shirt!";
string pattern = "colou?r";
foreach(Match match in Regex.Matches(source, pattern))
{
Console.WriteLine(match.Value);
}
In Java:
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
while(m.find())
{
System.out.println(source.substring(m.start(), m.end()));
}
I tried to be fair to both languages in the sample code above.
The first thing you notice here is the .Value member of the Match class (compared to using .start() and .end() in Java).
Why should I create two objects when I can call a static function like Regex.Matches or Regex.Match, etc.?
In more advanced usages, the difference shows itself much more. Look at the method Groups, dictionary length, Capture, Index, Length, Success, etc. These are all very necessary features that in my opinion should be available for Java too.
Of course all of these features can be manually added by a custom proxy (helper) class. This is main reason why I asked this question. We don't have the breeze of Regex in Perl but at least we can use the .Net approach to Regex which I think is very cleverly designed.
From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java’s regexes are a long, long, long ways from the convenience you find in Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we’re stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.
Where to begin? There are the so-called convenience methods of the String class: matches, replaceAll, replaceFirst, and split. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here’s a partial list of those problems, and what can and cannot be done about them.
The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.
There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you have to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.
The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.
The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.
The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!
It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!
Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.
There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.
The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.
The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.
Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.
There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.
The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.
The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.
Finally, to show you just how brain-damaged Java’s truly regexes are, consider this multiline pattern, which shows many of the flaws already described:
String rx =
"(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
+ " # next is a big can't-have set \n"
+ "(?! ^ .* \n"
+ " (?: ^ \\d+ $ \n"
+ " | ^ \\p{Lu} - \\p{Lu} $ \n"
+ " | Invitrogen \n"
+ " | Clontech \n"
+ " | L-L-X-X # dashes ok \n"
+ " | Sarstedt \n"
+ " | Roche \n"
+ " | Beckman \n"
+ " | Bayer \n"
+ " ) # end alternatives \n"
+ " \\b # only on a word boundary \n"
+ ") # end negated lookahead \n"
;
Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don’t work right on Unicode. There are many more problems beyond that.
Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it’s all locked down with the finality of a death sentence, and it cannot be fixed.
So Alireza Noori, if you feel Java’s clumsy regexes are too hosed for reliable and convenient regex processing ever to be possible in Java, I cannot gainsay you. Sorry, but that’s just the way it is.
“Fixed in the Next Release!”
Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:
The Unicode Script property is now supported. You may use any of the equivalent forms \p{Script=Greek}, \p{sc=Greek}, \p{IsGreek}, or \p{Greek}. This is inherently superior to the old clunky block properties. It means you can do things like [\p{Latin}\p{Common}\p{Inherited}], which is quite important.
The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the \x{⋯} notation, such as \x{1D402}. This works even inside character classes, finally allowing [\x{1D400}-\x{1D419}] to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.
Named groups are now supported via the standard notation (?<NAME>⋯) to create it and \k<NAME> to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.
A new Pattern compile flag, Pattern.UNICODE_CHARACTER_CLASSES and associated embeddable switch, (?U), will now swap around all the definitions of things like \w, \b, \p{alpha}, and \p{punct}, so that they now conform to the definitions of those things required by The Unicode Standard.
The missing or misdefined binary properties \p{IsLowercase}, \p{IsUppercase}, and \p{IsAlphabetic} will now be supported, and these correspond to methods in the Character class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.
These enhancements and fixes are very important to finally have, and so I am glad, even excited, to have them.
But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There’s just too much missing from Java’s still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model never works, which is all Java regexes are. You have to start over from first principles, the way Groovy did.
Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is that really true? Fewer even that you think, it turns out.
It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java’s Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It’s positively parochial. And no, there are Nᴏᴛ Hᴀᴘᴘʏ; would you be?
Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn’t hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make very sure that all the necessary infrastructure support is there before you invest too much. Otherwise you too may get stuck without any real options once you’re too far into it to rescue your project.
Caveat Emptor
One can rant, or one can simply write:
public class Regex {
/**
* #param source
* the string to scan
* #param pattern
* the regular expression to scan for
* #return the matched
*/
public static Iterable<String> matches(final String source, final String pattern) {
final Pattern p = Pattern.compile(pattern);
final Matcher m = p.matcher(source);
return new Iterable<String>() {
#Override
public Iterator<String> iterator() {
return new Iterator<String>() {
#Override
public boolean hasNext() {
return m.find();
}
#Override
public String next() {
return source.substring(m.start(), m.end());
}
#Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
}
Used as you wish:
public class RegexTest {
#Test
public void test() {
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
for (String match : Regex.matches(source, pattern)) {
System.out.println(match);
}
}
}
Some of the API flaws mentioned in #tchrist's answer were fixed in Kotlin.
Boy, do I hear you on that one Alireza! Regex's are confusing enough without there being so many syntax variations amonng them. I too do a lot more C# than Java programming and had the same issue.
I found this to be very helpful:
http://www.tusker.org/regex/regex_benchmark.html
- it's a list of alternative regular expression implementations for Java, benchmarked.
This one is darned good, if I do say so myself!
regex-tester-tool

Is it a good idea to use unicode symbols as Java identifiers?

I have a snippet of code that looks like this:
double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);
Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?
It's a bad idea, for various reasons.
Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.
Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.
Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):
Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.
...
Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.
This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!
There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().
Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.
looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?
I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.
That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:
double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....
It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?
Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?
It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.
Why not?
If the people working on that code can type those easily, it's acceptable.
But god help those who can't display unicode, or who can't type them.
In a perfect world, this would be the recommended way.
Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.
If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.
(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).

Categories