Backreferences Syntax in Replacement Strings (Why Dollar Sign?) - java

In Java, and it seems in a few other languages, backreferences in the pattern are preceded by a backslash (e.g. \1, \2, \3, etc), but in a replacement string they preceded by a dollar sign (e.g. $1, $2, $3, and also $0).
Here's a snippet to illustrate:
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "\\2-\\1") // WRONG!!!
); // prints "2-1"
System.out.println(
"left-right".replaceAll("(.*)-(.*)", "$2-$1") // CORRECT!
); // prints "right-left"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US\\$ $1")
); // prints "You want US$ million?!?"
System.out.println(
"You want million dollar?!?".replaceAll("(\\w*) dollar", "US$ \\1")
); // throws IllegalArgumentException: Illegal group reference
Questions:
Is the use of $ for backreferences in replacement strings unique to Java? If not, what language started it? What flavors use it and what don't?
Why is this a good idea? Why not stick to the same pattern syntax? Wouldn't that lead to a more cohesive and an easier to learn language?
Wouldn't the syntax be more streamlined if statements 1 and 4 in the above were the "correct" ones instead of 2 and 3?

Is the use of $ for backreferences in replacement strings unique to Java?
No. Perl uses it, and Perl certainly predates Java's Pattern class. Java's regex support is explicitly described in terms of Perl regexes.
For example: http://perldoc.perl.org/perlrequick.html#Search-and-replace
Why is this a good idea?
Well obviously you don't think it is a good idea! But one reason that it is a good idea is to make Java search/replace support (more) compatible with Perl's.
There is another possible reason why $ might have been viewed as a better choice than \. That is that \ has to be written as \\ in a Java String literal.
But all of this is pure speculation. None of us were in the room when the design decisions were made. And ultimately it doesn't really matter why they designed the replacement String syntax that way. The decisions have been made and set in concrete, and any further discussion is purely academic ... unless you just happen to be designing a new language or a new regex library for Java.

After doing some research, I've understood the issues now: Perl had to use a different symbol for pattern backreferences and replacement backreferences, and while java.util.regex.* doesn't have to follow suit, it chooses to, not for a technical but rather traditional reason.
On the Perl side
(Please keep in mind that all I know about Perl at this point comes from reading Wikipedia articles, so feel free to correct any mistakes I may have made)
The reason why it had to be done this way in Perl is the following:
Perl uses $ as a sigil (i.e. a symbol attached to variable name).
Perl string literals are variable interpolated.
Perl regex actually captures groups as variables $1, $2, etc.
Thus, because of the way Perl is interpreted and how its regex engine works, a preceding slash for backreferences (e.g. \1) in the pattern must be used, because if the sigil $ is used instead (e.g. $1), it would cause unintended variable interpolation into the pattern.
The replacement string, due to how it works in Perl, is evaluated within the context of every match. It is most natural for Perl to use variable interpolation here, so the regex engine captures groups into variables $1, $2, etc, to make this work seamlessly with the rest of the language.
References
Wikipedia/String literal - variable interpolation
Wikipedia/Sigil (computer programming)
On the Java side
Java is a very different language than Perl, but most importantly here is that there is no variable interpolation. Moreover, replaceAll is a method call, and as with all method calls in Java, arguments are evaluated once, prior to the method invoked.
Thus, variable interpolation feature by itself is not enough, since in essence the replacement string must be re-evaluated on every match, and that's just not the semantics of method calls in Java. A variable-interpolated replacement string that is evaluated before the replaceAll is even invoked is practically useless; the interpolation needs to happen during the method, on every match.
Since that is not the semantics of Java language, replaceAll must do this "just-in-time" interpolation manually. As such, there is absolutely no technical reason why $ is the escape symbol for backreferences in replacement strings. It could've very well been the \. Conversely, backreferences in the pattern could also have been escaped with $ instead of \, and it would've still worked just as fine technically.
The reason Java does regex the way it does is purely traditional: it's simply following the precedent set by Perl.

Related

Java Matcher matches() method to match the entire region against the pattern

I have a pattern (\{!(.*?)\})+ that can be used to validate an expression of format {!someExpression} one or more number of times.
I am performing
Pattern.compile("(\\{!(.*?)\\})+").matcher("{!expression1} {!expression2}").matches() to match the entire region against the pattern.
There is a space between expression1 and expression2.
Expected -> false
Actual -> true
I tried both greedy and lazy quantifiers but not able to figure out the catch here. Any help is appreciated.
Of course it matches. Your regexp says so. matches() matches the whole string, so you're doing exactly what you are asking. The point is, that regex matches the whole string. Try it in any regex tool.
Specifically, (.*?) will happily match expression1} {!expression2. Why shouldn't it? You said 'non-greedy' which doesn't do anything unless we're talking about subgroup matching; non-greediness cannot change what is being matched, it only affects, if it matches, how the groups are divided out. Non-greedy does not mean 'magically do what I want you to', however useful that might seem to be. . will match } just as well as x.
As a general rule if you're using non-greediness you're doing it wrong. It's not a universal rule; if you really know what you're doing (mostly: That you're modifying how backrefs / group matches / find() ends up spacing it out), it's fine. If you're tossing non-greediness in there as you write your regexp that's usually a sign you misunderstand what you're actually writing down.
Presumably, your intent with the non-greedy operator here is that you do not want it to also consume the } that 'ends' the {!expr} block.
In which case, just ask for that then: "Consume everything that isn't a }":
Pattern.compile("(\\{!([^}]*)\\})+").matcher("{!expression1} {!expression2}").matches()
works great.
If your intent is instead that expressions can also contain {} symbols and that this is a much more convoluted grammar system then your question cannot be answered without a full breakdown of what the grammarsystem entails. Note that many grammars are not 'regular' (that's a specific term that refers to a subset of all imaginable grammars), then it cannot be parsed out with a regular expression. That's what the 'regular' in regular expression refers to: A class of grammars. regexes can be used meaningfully on anything that fits a regular grammar. They are useless for anything that isn't, even if it seems like it could work. Thus, if there is a sizable grammar behind this {expr} syntax, it's possible you need an actual full parser for it.
As a simple example, java the language is not regular and therefore cannot meaningfully be parsed with regexes (that is: Whatever aim your regex has, I can write a valid java file that the compiler understands which your regex won't).

Checking for specific strings with regex

I have a list of arbitrary length of Type String, I need to ensure each String element in the list is alphanumerical or numerical with no spaces and special characters such as - \ / _ etc.
Example of accepted strings include:
J0hn-132ss/sda
Hdka349040r38yd
Hd(ersd)3r4y743-2\d3
123456789
Examples of unacceptable strings include:
Hello
Joe
King
etc basically no words.
I’m currently using stringInstance.matches("regex") but not too sure on how to write the appropriate expression
if (str.matches("^[a-zA-Z0-9_/-\\|]*$")) return true;
else return false;
This method will always return true for words that don't conform to the format I mentioned.
A description of the regex I’m looking for in English would be something like:
Any String, where the String contains characters from (a-zA-Z AND 0-9 AND special characters)
OR (0-9 AND Special characters)
OR (0-9)
Edit: I have come up with the following expression which works but I feel that it may be bad in terms of it being unclear or to complex.
The expression:
(([\\pL\\pN\\pP]+[\\pN]+|[\\pN]+[\\pL\\pN\\pP]+)|([\\pN]+[\\pP]*)|([\\pN]+))+
I've used this website to help me: http://xenon.stanford.edu/~xusch/regexp/analyzer.html
Note that I’m still new to regex
WARNING: “Never” Write A-Z
All instances of ranges like A-Z or 0-9 that occur outside an RFC definition are virtually always ipso facto wrong in Unicode. In particular, things like [A-Za-z] are horrible antipatterns: they’re sure giveaways that the programmer has a caveman mentality about text that is almost wholly inappropriate this side of the Millennium. The Unicode patterns work on ASCII, but the ASCII patterns break on Uniocode, sometimes in ways that leave you open to security violations. Always write the Unicode version of the pattern no matter whether you are using 1970s data or modern Unicode, because that way you won’t screw up when you actually use real Java character data. It’s like the way you use your turn signal even when you “know” there is no one behind you, because if you’re wrong, you do no harm, whereas the other way, you very most certainly do. Get used to using the 7 Unicode categories:
\pL for Letters. Notice how \pL is a lot shorter to type than [A-Za-z].
\pN for Numbers.
\pM for Marks that combine with other code points.
\pS for Symbols, Signs, and Sigils. :)
\pP for Punctuation.
\pZ for Separators like spaces (but not control characters)
\pC for other invisible formatting and Control characters, including unassigned code points.
Solution
If you just want a pattern, you want
^[\pL\pN]+$
although in Java 7 you can do this:
(?U)^\w+$
assuming you don’t mind underscores and letters with arbitrary combining marks. Otherwise you have to write the very awkward:
(?U)^[[:alpha:]\pN]+$
The (?U) is new to Java 7. It corresponds to the Pattern class’s UNICODE_CHARACTER_CLASSES compilation flag. It switches the POSIX character classes like [:alpha:] and the simple shortcuts like \w to actually work with the full Java character set. Normally, they work only on the 1970sish ASCII set, which can be a security hole.
There is no way to make Java 7 always do this with its patterns without being told to, but you can write a frontend function that does this for you. You just have to remember to call yours instead.
Note that patterns in Java before v1.7 cannot be made to work according to the way UTS#18 on Unicode Regular Expressions says they must. Because of this, you leave yourself open to a wide range of bugs, infelicities, and paradoxes if you do not use the new Unicode flag. For example, the trivial and common pattern \b\w+\b will not be found to match anywhere at all within the string "élève", let alone in its entirety.
Therefore, if you are using patterns in pre-1.7 Java, you need to be extremely careful, far more careful than anyone ever is. You cannot use any of the POSIX charclasses or charclass shortcuts, including \w, \s, and \b, all of which break on anything but stone-age ASCII data. They cannot be used on Java’s native character set.
In Java 7, they can — but only with the right flag.
It is possible to refrase the description of needed regex to "contains at least one number" so the followind would work /.*[\pN].*/. Or, if you would like to limit your search to letters numbers and punctuation you shoud use /[\pL\pN\pP]*[\pN][\pL\pN\pP]*/. I've tested it on your examples and it works fine.
You can further refine your regexp by using lazy quantifiers like this /.*?[\pN].*?/. This way it would fail faster if there are no numbers.
I would like to recomend you a great book on regular expressions: Mastering regular expressions, it has a great introduction, in depth explanation of how regular expressions work and a chapter on regular expressions in java.
It looks like you just want to make sure that there are no spaces in the string. If so, you can this very simply:
return str.indexOf(" ") == -1;
This will return true if there are no spaces (valid by my understanding of your rules), and false if there is a space anywhere in the string (invalid).
Here is a partial answer, which does 0-9 and special characters OR 0-9.
^([\d]+|[\\/\-_]*)*$
This can be read as ((1 or more digits) OR (0 or more special char \ / - '_')) 0 or more times. It requires a digit, will take digits only, and will reject strings consisting of only special characters.
I used regex tester to test several of the strings.
Adding alphabetic characters seems easy, but a repetition of the given regexp may be required.

How to add features missing from the Java regex implementation?

I'm new to Java. As a .Net developer, I'm very much used to the Regex class in .Net. The Java implementation of Regex (Regular Expressions) is not bad but it's missing some key features.
I wanted to create my own helper class for Java but I thought maybe there is already one available. So is there any free and easy-to-use product available for Regex in Java or should I create one myself?
If I would write my own class, where do you think I should share it for the others to use it?
[Edit]
There were complaints that I wasn't addressing the problem with the current Regex class. I'll try to clarify my question.
In .Net the usage of a regular expression is easier than in Java. Since both languages are object oriented and very similar in many aspects, I expect to have a similar experience with using regex in both languages. Unfortunately that's not the case.
Here's a little code compared in Java and C#. The first is C# and the second is Java:
In C#:
string source = "The colour of my bag matches the color of my shirt!";
string pattern = "colou?r";
foreach(Match match in Regex.Matches(source, pattern))
{
Console.WriteLine(match.Value);
}
In Java:
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);
while(m.find())
{
System.out.println(source.substring(m.start(), m.end()));
}
I tried to be fair to both languages in the sample code above.
The first thing you notice here is the .Value member of the Match class (compared to using .start() and .end() in Java).
Why should I create two objects when I can call a static function like Regex.Matches or Regex.Match, etc.?
In more advanced usages, the difference shows itself much more. Look at the method Groups, dictionary length, Capture, Index, Length, Success, etc. These are all very necessary features that in my opinion should be available for Java too.
Of course all of these features can be manually added by a custom proxy (helper) class. This is main reason why I asked this question. We don't have the breeze of Regex in Perl but at least we can use the .Net approach to Regex which I think is very cleverly designed.
From your edited example, I can now see what you would like. And you have my sympathies in this, too. Java’s regexes are a long, long, long ways from the convenience you find in Ruby or Perl. And they pretty much always will be; this cannot be fixed, so we’re stuck with this mess forever — at least in Java. Other JVM languages do a better job at this, especially Groovy. But they still suffer some of the inherent flaws, and can only go so far.
Where to begin? There are the so-called convenience methods of the String class: matches, replaceAll, replaceFirst, and split. These can sometimes be ok in small programs, depending how you use them. However, they do indeed have several problems, which it appears you have discovered. Here’s a partial list of those problems, and what can and cannot be done about them.
The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.
There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you have to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.
The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.
The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.
The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!
It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!
Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.
There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.
The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.
The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.
Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.
There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.
The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.
The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.
Finally, to show you just how brain-damaged Java’s truly regexes are, consider this multiline pattern, which shows many of the flaws already described:
String rx =
"(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
+ " # next is a big can't-have set \n"
+ "(?! ^ .* \n"
+ " (?: ^ \\d+ $ \n"
+ " | ^ \\p{Lu} - \\p{Lu} $ \n"
+ " | Invitrogen \n"
+ " | Clontech \n"
+ " | L-L-X-X # dashes ok \n"
+ " | Sarstedt \n"
+ " | Roche \n"
+ " | Beckman \n"
+ " | Bayer \n"
+ " ) # end alternatives \n"
+ " \\b # only on a word boundary \n"
+ ") # end negated lookahead \n"
;
Do you see how unnatural that is? You have to put literal newlines in your strings; you have to use non-Java comments; you cannot make anything line up because of the extra backslashes; you have to use definitions of things that don’t work right on Unicode. There are many more problems beyond that.
Not only are there no plans to fix almost any of these grievous flaws, it is indeed impossible to fix almost any of them at all, because you change old programs. Even the normal tools of OO design are forbidden to you because it’s all locked down with the finality of a death sentence, and it cannot be fixed.
So Alireza Noori, if you feel Java’s clumsy regexes are too hosed for reliable and convenient regex processing ever to be possible in Java, I cannot gainsay you. Sorry, but that’s just the way it is.
“Fixed in the Next Release!”
Just because some things can never be fixed does not mean that nothing can ever be fixed. It just has to be done very carefully. Here are the things I know of which are already fixed in current JDK7 or proposed JDK8 builds:
The Unicode Script property is now supported. You may use any of the equivalent forms \p{Script=Greek}, \p{sc=Greek}, \p{IsGreek}, or \p{Greek}. This is inherently superior to the old clunky block properties. It means you can do things like [\p{Latin}\p{Common}\p{Inherited}], which is quite important.
The UTF-16 bug has a workaround. You may now specify any Unicode code point by its number using the \x{⋯} notation, such as \x{1D402}. This works even inside character classes, finally allowing [\x{1D400}-\x{1D419}] to work properly. You still must double backslash it though, and it only works in regexex, not strings in general as it really ought to.
Named groups are now supported via the standard notation (?<NAME>⋯) to create it and \k<NAME> to backreference it. These still contribute to numeric group numbers, too. However, you cannot get at more than one of them in the same pattern, nor can you use them for recursion.
A new Pattern compile flag, Pattern.UNICODE_CHARACTER_CLASSES and associated embeddable switch, (?U), will now swap around all the definitions of things like \w, \b, \p{alpha}, and \p{punct}, so that they now conform to the definitions of those things required by The Unicode Standard.
The missing or misdefined binary properties \p{IsLowercase}, \p{IsUppercase}, and \p{IsAlphabetic} will now be supported, and these correspond to methods in the Character class. This is important because Unicode makes a significant and pervasive distinction between mere letters and cased or alphabetic code points. These key properties are among those 11 essential properties that are absolutely required for Level 1 compliance with UTS#18, “Unicode Regular Expresions”, without which you really cannot work with Unicode.
These enhancements and fixes are very important to finally have, and so I am glad, even excited, to have them.
But for industrial-strength, state-of-the-art regex and/or Unicode work, I will not be using Java. There’s just too much missing from Java’s still-patchy-after-20-years Unicode model to get real work done if you dare to use the character set that Java gives. And the bolted-on-the-side model never works, which is all Java regexes are. You have to start over from first principles, the way Groovy did.
Sure, it might work for very limited applications whose small customer base is limited to English-language monoglots rural Iowa with no external interactions or any need for characters beyond what an old-style telegraph could send. But for how many projects is that really true? Fewer even that you think, it turns out.
It is for this reason that a certain (and obvious) multi-billion-dollar just recently cancelled international deployment of an important application. Java’s Unicode support — not just in regexes, but throughout — proved to be too weak for the needed internationalization to be done reliably in Java. Because of this, they have been forced to scale back from their originally planned wordwide deployment to a merely U.S. deployment. It’s positively parochial. And no, there are Nᴏᴛ Hᴀᴘᴘʏ; would you be?
Java has had 20 years to get it right, and they demonstrably have not done so thus far, so I wouldn’t hold my breath. Or throw good money after bad; the lesson here is to ignore the hype and instead apply due diligence to make very sure that all the necessary infrastructure support is there before you invest too much. Otherwise you too may get stuck without any real options once you’re too far into it to rescue your project.
Caveat Emptor
One can rant, or one can simply write:
public class Regex {
/**
* #param source
* the string to scan
* #param pattern
* the regular expression to scan for
* #return the matched
*/
public static Iterable<String> matches(final String source, final String pattern) {
final Pattern p = Pattern.compile(pattern);
final Matcher m = p.matcher(source);
return new Iterable<String>() {
#Override
public Iterator<String> iterator() {
return new Iterator<String>() {
#Override
public boolean hasNext() {
return m.find();
}
#Override
public String next() {
return source.substring(m.start(), m.end());
}
#Override
public void remove() {
throw new UnsupportedOperationException();
}
};
}
};
}
}
Used as you wish:
public class RegexTest {
#Test
public void test() {
String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
for (String match : Regex.matches(source, pattern)) {
System.out.println(match);
}
}
}
Some of the API flaws mentioned in #tchrist's answer were fixed in Kotlin.
Boy, do I hear you on that one Alireza! Regex's are confusing enough without there being so many syntax variations amonng them. I too do a lot more C# than Java programming and had the same issue.
I found this to be very helpful:
http://www.tusker.org/regex/regex_benchmark.html
- it's a list of alternative regular expression implementations for Java, benchmarked.
This one is darned good, if I do say so myself!
regex-tester-tool

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.
For now, I've got something like this:
Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
So, when value of "string" is variable*func()*20
printout is:
variable
func
Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:
Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
if(matcher.group(3).isEmpty()) {
System.out.println(matcher.group(2));
}
}
It works, the printout is correct, but I don't like the additional check. Any ideas? Please?
EDIT (2011-04-12):
Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2. No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.
This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.
If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)
Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):
Pattern regex = Pattern.compile(
"\\b # word boundary\n" +
"[A-Za-z]# 1 ASCII letter\n" +
"\\w* # 0+ alnums\n" +
"\\b # word boundary\n" +
"(?! # Lookahead assertion: Make sure there is no...\n" +
" \\s* # optional whitespace\n" +
" \\( # opening parenthesis\n" +
") # ...at this position in the string",
Pattern.COMMENTS);
This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1). And of course this matches lots of other stuff (inside strings, comments, etc.)...
If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.
A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar).
For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.
It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags
As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

How do I convert a regular expression in a valid .NET format to valid Java format?

I want to use the following regular expression which is written within a C# .NET code, in a Java code, but I can't seem to convert it right, can you help me out?
Regex(#"\w+:\/\/(?<Domain>[\x21-\x22\x24-\x2E\x30-\x3A\x40-\x5A\x5F\x61-\x7A]+)(?<Relative>/?\S*)", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
Java does not have the # string notation. So, make sure you escape all the '\' in your regexp. (\w+ becomes> \\w+, \/ becomes> \\/, \x21 becomes> \\x21, etc. )
The most direct translation would be:
Pattern p = Pattern.compile(
"\\w+://([\\x21-\\x22\\x24-\\x2E\\x30-\\x3A\\x40-\\x5A\\x5F\\x61-\\x7A]+)(/?\\S*)",
Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Java has no equivalent for C#'s verbatim strings, so you always have to escape backslashes. And Java's regexes don't support named groups, so I converted those to simple capturing groups (named groups are due to be added in Java 7).
But there are a few problems with the original regex:
The RegexOptions.Compiled modifier doesn't do what you probably think it does. Specifically, it's not related to Java's compile() method; that's just a factory method, roughly equivalent to C#'s new Regex() constructor. The Compiled modifier causes the regex to be compiled to CIL bytecode, which can make it match a lot faster, but at a considerable cost in upfront processing and memory use--and that memory never gets garbage-collected. If you don't use the regex a lot, the Compiled option is probably doing more harm than good, performance-wise.
The IgnoreCase/CASE_INSENSITIVE modifier is pointless since your regex always matches both upper- and lowercase variants wherever it matches letters.
The Singleline/DOTALL modifier is pointless since you never use the dot metacharacter.
In .NET regexes, the character-class shorthand \w is Unicode-aware, equivalent to [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. In Java it's ASCII-only -- [A-Za-z0-9_]-- which seems to be more in line with the way you're using it (you could "dumb it down" in .NET by using the RegexOptions.ECMAScript modifier).
So the actual translation would be more like this:
Pattern p = Pattern.compile("\\w+://([\\w!\"$.:#]+)(?:/(\\S*))?");
Named groups are done differently in .NET than in all the other Regex flavors. You have:
(?<Domain>pattern)
Java (and everyone else) expects:
(?P<Domain>pattern)

Categories