How would you translate this Perl regex into Java?
/pattern/i
While compiles, it does not match "PattErn" for me, it fails
Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");
System.out.println(m.matches()); // prints "false"
How would you translate this Perl regex into Java?
/pattern/i
You can't.
There are a lot of reasons for this. Here are a few:
Java doesn't support as expressive a regex language as Perl does. It lacks grapheme support (like \X) and full property support (like \p{Sentence_Break=SContinue}), is missing Unicode named characters, doesn't have a (?|...|...|) branch reset operator, doesn’t have named capture groups or a logical \x{...} escape before Java 7, has no recursive regexes, etc etc etc. I could write a book on what Java is missing here: Get used to going back to a very primitive and awkward to use regex engine compared with what you’re used to.
Another even worse problem is because you have lookalike faux amis like \w and and \b and \s, and even \p{alpha} and \p{lower}, which behave differently in Java compared with Perl; in some cases the Java versions are completely unusable and buggy. That’s because Perl follows UTS#18 but before Java 7, Java did not. You must add the UNICODE_CHARACTER_CLASSES flag from Java 7 to get these to stop being broken. If you can’t use Java 7, give up now, because Java had many many many other Unicode bugs before Java 7 and it just isn’t worth the pain of dealing with them.
Java handles linebreaks via ^ and $ and ., but Perl expects Unicode linebreaks to be \R. You should look at UNIX_LINES to understand what is going on there.
Java does not by default apply any Unicode casefolding whatsoever. Make sure to add the UNICODE_CASE flag to your compilation. Otherwise you won’t get things like the various Greek sigmas all matching one another.
Finally, it is different because at best Java only does simple casefolding, while Perl always does full casefolding. That means that you won’t get \xDF to match "SS" case insensitively in Java, and similar related issues.
In summary, the closest you can get is to compile with the flags
CASE_INSENSITIVE | UNICODE_CASE | UNICODE_CHARACTER_CLASSES
which is equivalent to an embedded "(?iuU)" in the pattern string.
And remember that match in Java doesn’t mean match, perversely enough.
EDIT
And here’s the rest of the story...
While compiles, it does not match "PattErn" for me, it fails
Pattern p = Pattern.compile("/pattern/i");
Matcher m = p.matcher("PattErn");
System.out.println(m.matches()); // prints "false"
You shouldn’t have slashes around the pattern.
The best you can do is to translate
$line = "I have your PaTTerN right here";
if ($line =~ /pattern/i) {
print "matched.\n";
}
this way
import java.util.regex.*;
String line = "I have your PaTTerN right here";
String pattern = "pattern";
Pattern regcomp = Pattern.compile(pattern, CASE_INSENSITIVE
| UNICODE_CASE
// comment next line out for legacy Java \b\w\s breakage
| UNICODE_CHARACTER_CLASSES
);
Matcher regexec = regcomp.matcher(line);
if (regexec.find()) {
System.out.println("matched");
}
There, see how much easier that isn’t? :)
Java regex do not have delimiters, and use a separate argument for modifies:
Pattern p = Pattern.compile("pattern", Pattern.CASE_INSENSITIVE);
The Perl equivalent of:
/pattern/i
in Java would be:
Pattern p = Pattern.compile("(?i)pattern");
Or simply do:
System.out.println("PattErn".matches("(?i)pattern"));
Note that "string".matches("pattern") validates the pattern against the entire input string. In other words, the following would return false:
"foo pattern bar".matches("pattern")
Related
I'm trying to extract a string from a String in Regex Java
Pattern pattern = Pattern.compile("((.|\\n)*).{4}InsurerId>\\S*.{5}InsurerId>((.|\\n)*)");
Matcher matcher = pattern.matcher(abc);
I'm trying to extract the value between
<_1:InsurerId>F2021633_V1</_1:InsurerId>
I'm not sure where am I going wrong but I don't get output for
if (matcher.find())
{
System.out.println(matcher.group(1));
}
You can use:
Pattern pattern = Pattern.compile("<([^:]+:InsurerId)>([^<]*)</\\1>");
Matcher matcher = pattern.matcher(abc);
if (matcher.find()) {
System.out.println(matcher.group(2));
}
RegEx Demo
You may want to use the totally awesome page http://regex101.com/ to test your regular expressions. As you can see at https://regex101.com/r/rV8uM3/1, you only have empty capturing groups, but let me explain to you what you did. :D
((.|\n)*) This matches any character, or a new line, unimportant how often. It is capturing, so your first matching group will always be everything before <_1:InsurerId>, or an empty string. You can match any character instead, it will include new lines: .*. You can even leave it away as it isn't actually part of the String you want to match - using anything here will actually be a problem if you have multiple InsurerIds in your file and want to get them all.
.{4}InsurerId> This matches "InsurerId>" with any four characters in front of it and is exactly what you want. As the first character is probably always an opening angle bracket (and you don't want stuff like "<ExampleInsurerId>"), I'd suggest using <.{3}InsurerId> instead. This still could have some problems (<Test id="<" xInsurerId>), so if you know exactly that it's "_<a digit>:", why not use <_\d:InsurerId>?
\S* matches everything except for whitespaces - probably not the best idea as XML and similar files can be written to not contain any space at all. You want to have everything to the next tag, so use [^<]* - this matches everything except for an opening angle bracket. You also want to get this value later, so you have to use a capturing group: ([^<]*)
.{5}InsurerId> The same thing here: use <\/.{3}InsurerId> or <\/_\d:InsurerId> (forward slashes are actually characters interpreted by other RegEx implementations, so I suggest escaping them)
((.|\n)*) Again the same thing, just leave it away
The resulting Regular Expression would then be the following:
<_\d:InsurerId>([^<]*)<\/_\d:InsurerId>
And as you can see at https://regex101.com/r/mU6zZ3/1 - you have exactly one match, and it's even "F2021633_V1" :D
For Java, you have to escape the backslashes, so the resulting code would look like this:
Pattern pattern = Pattern.compile("<_\\d:InsurerId>([^<]*)<\\/_\\d:InsurerId>");
If you are using Java 7 and above, you can use naming groups to make the Regex a little bit more readable (also see the backreference group \k for close tag to match the openning tag):
Pattern pattern = Pattern.compile("(?:<(?<InsurancePrefix>.+)InsurerId>)(?<id>[A-Z0-9_]+)</\\k<InsurancePrefix>InsurerId>");
Matcher matcher = pattern.matcher("<_1:InsurerId>F2021633_V1</_1:InsurerId>");
if (matcher.matches()) {
System.out.println(matcher.group("id"));
}
Using back reference the matches() fails, for example, on this text
<_1:InsurerId>F2021633_V1</_2:InsurerId>
which is correct
Javadoc has a good explanation: https://docs.oracle.com/javase/8/docs/api/
Also you might consider using a different tool (XML parser) instead of Regex, as well, as other people have to support your code, and complex Regex is usually difficult to understand.
I have a working regex in Python and I am trying to convert to Java. It seems that there is a subtle difference in the implementations.
The RegEx is trying to match another reg ex. The RegEx in question is:
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
One of the strings that it is having problems on is: /\s+/;
The reg ex is not supposed to be matching the ending ;. In Python the RegEx works correctly (and does not match the ending ;, but in Java it does include the ;.
The Question(s):
What can I do to get this RegEx working in Java?
Based on what I read here there should be no difference for this RegEx. Is there somewhere a list of differences between the RegEx implementations in Python vs Java?
Java doesn't parse Regular Expressions in the same way as Python for a small set of cases. In this particular case the nested ['s were causing problems. In Python you don't need to escape any nested [ but you do need to do that in Java.
The original RegEx (for Python):
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
The fixed RegEx (for Java and Python):
/(\\.|[^\[/\\\n]|\[(\\.|[^\]\\\n])*\])+/([gim]+\b|\B)
The obvious difference b/w Java and Python is that in Java you need to escape a lot of characters.
Moreover, you are probably running into a mismatch between the matching methods, not a difference in the actual regex notation:
Given the Java
String regex, input; // initialized to something
Matcher matcher = Pattern.compile( regex ).matcher( input );
Java's matcher.matches() (also Pattern.matches( regex, input )) matches the entire string. It has no direct equivalent in Python. The same result can be achieved by using re.match( regex, input ) with a regex that ends with $.
Java's matcher.find() and Python's re.search( regex, input ) match any part of the string.
Java's matcher.lookingAt() and Python's re.match( regex, input ) match the beginning of the string.
For more details also read Java's documentation of Matcher and compare to the Python documentation.
Since you said that isn't the problem, I decided to do a test: http://ideone.com/6w61T
It looks like java is doing exactly what you need it to (group 0, the entire match, doesn't contain the ;). Your problem is elsewhere.
I have an input field which is localized. I need to add a validation using a regex that it must take only alphabets and numbers. I could have used [a-z0-9] if I were using only English.
As of now, I am using the method Character.isLetterOrDigit(name.charAt(i)) (yes, I am iterating through each character) to filter out the alphabets present in various languages.
Are there any better ways of doing it? Any regex or other libraries available for this?
Since Java 7 you can use Pattern.UNICODE_CHARACTER_CLASS
String s = "Müller";
Pattern p = Pattern.compile("^\\w+$", Pattern.UNICODE_CHARACTER_CLASS);
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("not found");
}
with out the option it will not recognize the word "Müller", but using Pattern.UNICODE_CHARACTER_CLASS
Enables the Unicode version of Predefined character classes and POSIX character classes.
See here for more details
You can also have a look here for more Unicode information in Java 7.
and here on regular-expression.info an overview over the Unicode scripts, properties and blocks.
See here a famous answer from tchrist about the caveats of regex in Java, including an updated what has changed with Java 7 (of will be in Java 8)
boolean foundMatch = name.matches("[\\p{L}\\p{Nd}]*");
should work.
[\p{L}\p{Nd}] matches a character that is either a Unicode letter or digit. The regex .matches() method ensures that the entire string matches the pattern.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
-- Jamie Zawinksi
I say this in jest, but iterating through the String like you are doing will have runtime performance at least as good as any regex — there's no way a regex can do what you want any faster; and you don't have the overhead of compiling a pattern in the first place.
So as long as:
the validation doesn't need to do anything else regex-like (nothing was mentioned in the question)
the intention of the code looping through the String is clear (and if not, refactor until it is)
then why replace it with a regex just because you can?
I am trying to extract the information inside of these tags along the lines of
hello=barry0238293<
hello=terry2938298<
hello=bruce8392382<
The expression I have written is
hello=(.*)<
I thought that this would have worked but it doesn't.
Could you point me in the right direction if I am doing this completely wrong?
(.*)< is not really a good regular expression. The star qualifier is greedy and it will consume all input, but then the regular expression engine will notice that there's something after it, and it will begin to backtrack until it finds the following text (the less than sign in this case). This can lead to serious performance hits. For example, I had one of these in some code (I was being lazy -- bad programmer!), and it was taking something like 1100+ millliseconds to execute on a very small input string.
A better expression would be something like this "hello=([^<]*)<" The braces [] form a character class, but with the carat ^ as the first entry in the character class, it negates the class. i.e. its saying find characters that are not in the following set, and then you add the terminating character < and the regex engine will seek until it finds the less than sign without having to backtrack.
I hacked out a quick example of using the raw Java regex classes in clojure to be sure that my regex works. I ignored the built in regex support in clojure to show that it works with the regular Java API to make sure that aspect of it is clear. (This is not a good example of how to do regular expressions in clojure.) I added comments (they follow the ;; in the example) that translate to Java, but it should be pretty clear whats going on if you know the regex APIs.
;; create a pattern object
user=> (def p (java.util.regex.Pattern/compile "hello=([^<]*)<"))
#'user/p
;; create a matcher for the string
user=> (def m (.matcher p "hello=bruce8392382<"))
#'user/m
;; call m.matches()
user=> (.matches m)
true
;; call m.group(1)
user=> (.group m 1)
"bruce8392382"
I believe this should be close: /hello\=(\w*)\</
'=' and '<' are meta-characters so adding the '\' before them makes sure they're properly recognized. '\w' matches [a-zA-Z0-9], but if you needed separation between the name and number you can replace it with something like ([a-zA-Z]+\d+).
(.*) doesn't work because it's greedy, meaning that it will match the '<' at the end as well. You may need to tweak this further, but it should help you get started.
This works:
Pattern p = Pattern.compile("hello=(.*)<");
Matcher m = p.matcher("hello=bruce8392382<");
if (m.matches) {
System.out.println(m.group(1));
}
I'm trying to match chunks of JS code and extract string literals that contain a given keyword using Java.
After trying to come up with my own regexp to do this, I ended up modifying this generalized string-literal matching regexp (Pattern.COMMENTS used when building the patterns in Java):
(["'])
(?:\\?+.)*?
\1
to the following
(["'])
(?:\\?+.)*?
keyword
(?:\\?+.)*?
\1
The test cases:
var v1 = "test";
var v2 = "testkeyword";
var v3 = "test"; var v4 = "testkeyword";
The regexp correctly doesn't match line 1 and correctly matches line 2.
However, in line 3, instead of just matching "testkeyword", it matches the chunk
"test"; var v4 = "testkeyword"
which is wrong - the regexp matched the first double quote and did not terminate at the second double quote, going all the way till the end of line.
Does anyone have any ideas on how to fix this?
PS: Please keep in mind that the Regexp has to correctly handle escaped single and double quote characters inside of string literals (which the generalized matcher already did).
How about this modification:
(?:
"
(?:\\"|[^"\r\n])*
keyword
(?:\\"|[^"\r\n])*
"
|
'
(?:\\'|[^'\r\n])*
keyword
(?:\\'|[^'\r\n])*
'
)
After much revision (see edit history, viewers at home :), I believe this is my final answer:
(?:
"
(?:\\?+"|[^"])*
keyword
(?:\\?+"|[^"])*
"
|
'
(?:\\?+'|[^'])*
keyword
(?:\\?+'|[^'])*
'
)
You need to write two patterns for either single or double quoted strings, as there is no way to make the regex remember which opened the string. Then you can or them together with |.
Consider using code from Rhino -- JS in Java -- to get the real String literals.
Or, if you want to use regex, consider one find for the whole literal, then a nested test if the literal contains 'keyword'.
I think Tim's construction works, but I wouldn't bet on it in all situations, and the regex would have to get insanely unwieldy if it had to deal with literals that don't want to be found (as if trying to sneak by your testing). For example:
var v5 = "test\x6b\u0065yword"
Separate from any solution, my secret weapon for interactively working out regexes is a tool I made called Regex Powertoy, which unlike many such utilities runs in any browser with Java applet support.
A grammar to construct a string literal would look roughly like this:
string-literal ::= quote text quote
text ::= character text
| character
character ::= non-quote
| backslash quote
with non-quote, backslash, and quote being terminals.
A grammar is regular if it is context free (i.e. the left hand side of all rules is always a single non-terminal) and the right hand side of all rules is always either empty, a terminal, or a terminal followed by a non-terminal.
You may notice that the first rule given above has a terminal followed by a nonterminal followed by a terminal. This is thus not a regular grammar.
A regular expression is an expression that can parse regular languages (languages that can be constructed by a regular grammar). It is not possible to parse non-regular languages with regular expressions.
The difficulty you have in finding a suitable regular expression stems from the fact that a suitable regular expression doesn't exist. You will never arrive at code that is obviously correct, this way.
It is much easier to write a simple parser along the lines of above rules. Since the text contained by your string literals is regular, you can use a simple regular expression to look for your keyword---after you extracted that text from its surroundings.