Regex when pattern involves dollar sign ($) - java

I'm running into a bit of an issue when it comes to matching sub-patterns that involve the dollar sign. For example, consider the following chunk of text:
(en $) foo
oof ($).
ofo (env. 80 $US)
I'm using the following regex :
Pattern p = Pattern.compile(
"\\([\\p{InARABIC}\\s]+\\)|\\([\\p{InBasic_Latin}\\s?\\$]+\\)|\\)([\\p{InARABIC}\\s]+)\\(",
Pattern.CASE_INSENSITIVE);
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return m.group();
} else {
return "(" + match.group(1) + ")";
}
});
return replacement;
}
but can't match text containing $

This code is similar to replaceAll(regex, replacement). Problem is that $ isn't only special in regex argument, but also in replacement where it can be used as reference to match from groups like $x (where x is group ID) or ${groupName} if your regex has (?<groupName>subregex).
This allows us to write code like
String doubled = "abc".replaceAll(".", "$0$0");
System.out.println(doubled); //prints: aabbcc
which will replace each character with its two copies since each character will be matched by . and placed in group 0, so $0$0 represents two repetitions of that matched character.
But in your case you have $ in your text, so when it is matched you are replacing it with itself, so you are using in replacement $ without any information about group ID (or group name) which results in IllegalArgumentException: Illegal group reference.
Solution is to escape that $ in replacement part. You can do it manually, with \, but it is better to use method designed for that purpose Matcher#quoteReplacement (in case regex will evolve and you will need to escape more things, this method should evolve along with regex engine which should save you some trouble later)
So try changing your code to
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return Matcher.quoteReplacement(m.group());
// ^^^^^^^^^^^^^^^^^^^^^^^^
} else {
return Matcher.quoteReplacement("(" + match.group(1) + ")");
// ^^^^^^^^^^^^^^^^^^^^^^^^
}
});
return replacement;
}
}

Related

How to replace excessive SQL wildcard by single regex pattern?

I am creating a function that strips the illegal wildcard patterns from the input string. The ideal solution should use a single regex expression, if at all possible.
The illegal wildcard patterns are: %% and %_%. Each instance of those should be replaced with %.
Here's the rub... I'm trying to perform some fuzz testing by running the function against various inputs to try to make it and break it.
It works for the most part; however, with complicated inputs, it doesn't.
The rest of this question has been updated:
The following inputs should return empty string (not an exhaustive list):
The following inputs should return % (not an exhaustive list).
%_%
%%
%%_%%
%_%%%
%%_%_%
%%_%%%_%%%_%
There will be cases where there are other characters with the input... like:
Foo123%_%
Should return "Foo123%"
B4r$%_%
Should return "B4r$%"
B4rs%%_%
Should return "B4rs%"
%%Lorem_%%
Should return "%Lorem_%"
I have tried using several different patterns and my tests are failing.
String input = "%_%%%%_%%%_%";
// old method:
public static String ancientMethod1(String input){
if (input == null)
return "";
return input.replaceAll("%_%", "").replaceAll("%%", ""); // Output: ""
}
// Attempt 1:
// Doesn't quite work right.
// "A%%" is returned as "A%%" instead of "A%"
public static String newMethod1(String input) {
String result = input;
while (result.contains("%%") || result.contains("%_%"))
result = result.replaceAll("%%","%").replaceAll("%_%","%");
if (result.equals("%"))
return "";
return input;
}
// Attempt 2:
// Succeeds, but I would like to simplify this:
public static String newMethod2(String input) {
if (input == null)
return "";
String illegalPattern1 = "%%";
String illegalPattern2 = "%_%";
String result = input;
while (result.contains(illegalPattern1) || result.contains(illegalPattern2)) {
result = result.replace(illegalPattern1, "%");
result = result.replace(illegalPattern2, "%");
}
if (result.equals("%") || result.equals("_"))
return "";
return result;
}
Here's a more complete defined example of how I'm using this: https://gist.github.com/sometowngeek/697c839a1bf1c9ee58be283b1396cf2e
This regular expression string matches all your examples:
"%(?:_?%)+"
It matches strings consisting of a '%' character followed by one or more sequences consisting of zero or one '_' character and one '%' character (close to literal translation), which is another way of saying what I did in comments: "a sequence of '%' and '_' characters, beginning and ending with '%', and not containing two consecutive '_' characters".
I'm not quite sure, if the listed inputs might have other instances, if not, maybe an expression with start and end anchor would be much applicable here, either one by one, or with something similar to:
^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$";
final String string = "%_%\n"
+ "%%\n"
+ "%%_%%\n"
+ "%%%_%%%\n"
+ "%_%%%\n"
+ "%%%_%\n"
+ "%%_%_%\n"
+ "%%_%%%_%%%_%";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:
Your newMethod1 actually works, except you have a typo - you're returning the input parmeter, not the result of your processing!
Change:
return input; // oops!
to:
return result;
Also, because you're not using regex, you should use replace() rather than replaceAll(), ie:
result = result.replace("%%","%").replace("%_%","%"); // still replaces all occurrences
replace() still replaces all occurrences.
BTW, although not as strict, this works for all of your (currently) posted examples:
public static String myMethod(String input) {
return input.replaceAll("%[%_]*", "%");
}
It looks like all the patterns start with %, then have 0+ % or _ chars and end with %.
Use a mere
input = input.replaceAll("%[%_]*%", "%");
See the regex demo and the regex graph:
Details
% - a % char
[%_]* - 0 or more % or _ chars
% - a % char.

Replace regex pattern to lowercase in java

I'm trying to replace a url string to lowercase but wanted to keep the certain pattern string as it is.
eg: for input like:
http://BLABLABLA?qUERY=sth&macro1=${MACRO_STR1}&macro2=${macro_str2}
The expected output would be lowercased url but the multiple macros are original:
http://blablabla?query=sth&macro1=${MACRO_STR1}&macro2=${macro_str2}
I was trying to capture the strings using regex but didn't figure out a proper way to do the replacement. Also it seemed using replaceAll() doesn't do the job. Any hint please?
It looks like you want to change any uppercase character which is not inside ${...} to its lowercase form.
With construct
Matcher matcher = ...
StringBuffer buffer = new StringBuffer();
while (matcher.find()){
String matchedPart = ...
...
matcher.appendReplacement(buffer, replacement);
}
matcher.appendTail(buffer);
String result = buffer.toString();
or since Java 9 we can use Matcher#replaceAll​(Function<MatchResult,String> replacer) and rewrite it like
String replaced = matcher.replaceAll(m -> {
String matchedPart = m.group();
...
return replacement;
});
you can dynamically build replacement based on matchedPart.
So you can let your regex first try to match ${...} and later (when ${..} will not be matched because regex cursor will not be placed before it) let it match [A-Z]. While iterating over matches you can decide based on match result (like its length or if it starts with $) if you want to use use as replacement its lowercase form or original form.
BTW regex engine allows us to place in replacement part $x (where x is group id) or ${name} (where name is named group) so we could reuse those parts of match. But if we want to place ${..} as literal in replacement we need to escape \$. To not do it manually we can use Matcher.quoteReplacement.
Demo:
String yourUrlString = "http://BLABLABLA?qUERY=sth&macro1=${MACRO_STR1}&macro2=${macro_str2}";
Pattern p = Pattern.compile("\\$\\{[^}]+\\}|[A-Z]");
Matcher m = p.matcher(yourUrlString);
StringBuffer sb = new StringBuffer();
while(m.find()){
String match = m.group();
if (match.length() == 1){
m.appendReplacement(sb, match.toLowerCase());
} else {
m.appendReplacement(sb, Matcher.quoteReplacement(match));
}
}
m.appendTail(sb);
String replaced = sb.toString();
System.out.println(replaced);
or in Java 9
String replaced = Pattern.compile("\\$\\{[^}]+\\}|[A-Z]")
.matcher(yourUrlString)
.replaceAll(m -> {
String match = m.group();
if (match.length() == 1)
return match.toLowerCase();
else
return Matcher.quoteReplacement(match);
});
System.out.println(replaced);
Output: http://blablabla?query=sth&macro1=${MACRO_STR1}&macro2=${macro_str2}
This regex will match all the characters before the first &macro, and put everything between http:// and the first &macro in its own group so you can modify it.
http://(.*?)&macro
Tested here
UPDATE: If you don't want to use groups, this regex will match only the characters between http:// and the first &macro
(?<=http://)(.*?)(?=&macro)
Tested here

Finding a Match using java.lang.String.matches()

I have a String that contains new line characters say...
str = "Hello\n"+"Batman,\n" + "Joker\n" + "here\n"
I would want to know how to find the existance of a particular word say .. Joker in the string str using java.lang.String.matches()
I find that str.matches(".*Joker.*") returns false and returns true if i remove the new line characters. So what would be the regex expression to be used as an argument to str.matches()?
One way is... str.replaceAll("\\n","").matches(.*Joker.*);
The problem is that the dot in .* does not match newlines by default. If you want newlines to be matched, your regex must have the flag Pattern.DOTALL.
If you want to embed that in a regex used in .matches() the regex would be:
"(?s).*Joker.*"
However, note that this will match Jokers too. A regex does not have the notion of words. Your regex would therefore really need to be:
"(?s).*\\bJoker\\b.*"
However, a regex does not need to match all its input text (which is what .matches() does, counterintuitively), only what is needed. Therefore, this solution is even better, and does not require Pattern.DOTALL:
Pattern p = Pattern.compile("\\bJoker\\b"); // \b is the word anchor
p.matcher(str).find(); // returns true
You can do something much simpler; this is a contains. You do not need the power of regex:
public static void main(String[] args) throws Exception {
final String str = "Hello\n" + "Batman,\n" + "Joker\n" + "here\n";
System.out.println(str.contains("Joker"));
}
Alternatively you can use a Pattern and find:
public static void main(String[] args) throws Exception {
final String str = "Hello\n" + "Batman,\n" + "Joker\n" + "here\n";
final Pattern p = Pattern.compile("Joker");
final Matcher m = p.matcher(str);
if (m.find()) {
System.out.println("Found match");
}
}
You want to use a Pattern that uses the DOTALL flag, which says that a dot should also match new lines.
String str = "Hello\n"+"Batman,\n" + "Joker\n" + "here\n";
Pattern regex = Pattern.compile("".*Joker.*", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(str);
if (regexMatcher.find()) {
// found a match
}
else
{
// no match
}

Punctuation Regex in Java

First, i'm read the documentation as follow
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
And i want find any punctuation character EXCEPT #',& but i don't quite understand.
Here is :
public static void main( String[] args )
{
// String to be scanned to find the pattern.
String value = "#`~!#$%^";
String pattern = "\\p{Punct}[^#',&]";
// Create a Pattern object
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
// Now create matcher object.
Matcher m = r.matcher(value);
if (m.find()) {
System.out.println("Found value: " + m.groupCount());
} else {
System.out.println("NO MATCH");
}
}
Result is NO MATCH.
Is there any mismatch ?
Thanks
MRizq
You're matching two characters, not one. Using a (negative) lookahead should solve the task:
(?![#',&])\\p{Punct}
You may use character subtraction here:
String pat = "[\\p{Punct}&&[^#',&]]";
The whole pattern represents a character class, [...], that contains a \p{Punct} POSIX character class, the && intersection operator and [^...] negated character class.
A Unicode modifier might be necessary if you plan to also match all Unicode punctuation:
String pat = "(?U)[\\p{Punct}&&[^#',&]]";
^^^^
The pattern matches any punctuation (with \p{Punct}) except #, ', , and &.
If you need to exclude more characters, add them to the negated character class. Just remember to always escape -, \, ^, [ and ] inside a Java regex character class/set. E.g. adding a backslash and - might look like "[\\p{Punct}&&[^#',&\\\\-]]" or "[\\p{Punct}&&[^#',&\\-\\\\]]".
Java demo:
String value = "#`~!#$%^,";
String pattern = "(?U)[\\p{Punct}&&[^#',&]]";
Pattern r = Pattern.compile(pattern); // Create a Pattern object
Matcher m = r.matcher(value); // Now create matcher object.
while (m.find()) {
System.out.println("Found value: " + m.group());
}
Output:
Found value: #
Found value: !
Found value: #
Found value: %
Found value: ,

Exclude words with forward slash in Java regexp

I'm trying to allow only certain words through a regexp filter in Java, i.e.:
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\.-_]{1," + s.length() + "}$");
But I find that it allows through 140km/h because forward slash isn't handled. Ideally, this word should not be allowed.
Can anyone suggest a fix to my current version?
I'm new to regexp and don't particularly follow it fully yet.
The regexp is in a utils class method as follows:
public static boolean checkStringAlphaNumericChars(String s) {
s = s.trim();
if ((s == null) || (s.equals(""))) {
return false;
}
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\.-_]{1," + s.length() + "}$");
// Pattern p = Pattern.compile("^[a-zA-Z0-9_\\s]{1," + s.length() + "}");
Matcher m = p.matcher(s);
if (m.matches()) {
return true;
}
else {
return false;
}
}
I want to allow strings with underscore, space, period, minus. And to ensure that strings with alpha numerics like 123.45 or -500.00 are accepted but where 5,000.00 is not.
Is it because the hyphen is second-to-last in your character set and is therefore defining a range from the '.' to the '_', which includes '/'?
Try this:
Pattern p = Pattern.compile("^[a-zA-Z0-9\\s\\._-]$");
Also, NullUserException is right in that there is no need for {1," + s.length() + "}. The fact you start your expression with '^' and end it with '$' will ensure that the entire string is consumed.
Finally, you can make use of \w as a substitute for [a-zA-Z_0-9], simplifying your expression to "^[\\w\\s\\.-]$"
You can just use
public static boolean checkStringAlphaNumericChars(String s) {
return (s != null) && s.matches("[\\w\\s.-]+");
}
The short-circuited null check ensures s is not null when you try to do .matches() on it.
Using \w to look for alphanumerics plus the underscore. tchrist will also be the first to point out this is more correct than [A-Za-z0-9_]
The + at the very end ensures you have at least one character (ie: the string is not empty)
There's no need to use ^ and $ since .matches() tries to match the pattern against the whole string .
There's also no need to escape the dot (.) in a character class.
New Demo: http://ideone.com/qraob

Categories