How to replace excessive SQL wildcard by single regex pattern? - java

I am creating a function that strips the illegal wildcard patterns from the input string. The ideal solution should use a single regex expression, if at all possible.
The illegal wildcard patterns are: %% and %_%. Each instance of those should be replaced with %.
Here's the rub... I'm trying to perform some fuzz testing by running the function against various inputs to try to make it and break it.
It works for the most part; however, with complicated inputs, it doesn't.
The rest of this question has been updated:
The following inputs should return empty string (not an exhaustive list):
The following inputs should return % (not an exhaustive list).
%_%
%%
%%_%%
%_%%%
%%_%_%
%%_%%%_%%%_%
There will be cases where there are other characters with the input... like:
Foo123%_%
Should return "Foo123%"
B4r$%_%
Should return "B4r$%"
B4rs%%_%
Should return "B4rs%"
%%Lorem_%%
Should return "%Lorem_%"
I have tried using several different patterns and my tests are failing.
String input = "%_%%%%_%%%_%";
// old method:
public static String ancientMethod1(String input){
if (input == null)
return "";
return input.replaceAll("%_%", "").replaceAll("%%", ""); // Output: ""
}
// Attempt 1:
// Doesn't quite work right.
// "A%%" is returned as "A%%" instead of "A%"
public static String newMethod1(String input) {
String result = input;
while (result.contains("%%") || result.contains("%_%"))
result = result.replaceAll("%%","%").replaceAll("%_%","%");
if (result.equals("%"))
return "";
return input;
}
// Attempt 2:
// Succeeds, but I would like to simplify this:
public static String newMethod2(String input) {
if (input == null)
return "";
String illegalPattern1 = "%%";
String illegalPattern2 = "%_%";
String result = input;
while (result.contains(illegalPattern1) || result.contains(illegalPattern2)) {
result = result.replace(illegalPattern1, "%");
result = result.replace(illegalPattern2, "%");
}
if (result.equals("%") || result.equals("_"))
return "";
return result;
}
Here's a more complete defined example of how I'm using this: https://gist.github.com/sometowngeek/697c839a1bf1c9ee58be283b1396cf2e

This regular expression string matches all your examples:
"%(?:_?%)+"
It matches strings consisting of a '%' character followed by one or more sequences consisting of zero or one '_' character and one '%' character (close to literal translation), which is another way of saying what I did in comments: "a sequence of '%' and '_' characters, beginning and ending with '%', and not containing two consecutive '_' characters".

I'm not quite sure, if the listed inputs might have other instances, if not, maybe an expression with start and end anchor would be much applicable here, either one by one, or with something similar to:
^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "^%{1,3}(_%{1,3})?(_%{1,3})?(_%)?$";
final String string = "%_%\n"
+ "%%\n"
+ "%%_%%\n"
+ "%%%_%%%\n"
+ "%_%%%\n"
+ "%%%_%\n"
+ "%%_%_%\n"
+ "%%_%%%_%%%_%";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx Circuit
jex.im visualizes regular expressions:

Your newMethod1 actually works, except you have a typo - you're returning the input parmeter, not the result of your processing!
Change:
return input; // oops!
to:
return result;
Also, because you're not using regex, you should use replace() rather than replaceAll(), ie:
result = result.replace("%%","%").replace("%_%","%"); // still replaces all occurrences
replace() still replaces all occurrences.
BTW, although not as strict, this works for all of your (currently) posted examples:
public static String myMethod(String input) {
return input.replaceAll("%[%_]*", "%");
}

It looks like all the patterns start with %, then have 0+ % or _ chars and end with %.
Use a mere
input = input.replaceAll("%[%_]*%", "%");
See the regex demo and the regex graph:
Details
% - a % char
[%_]* - 0 or more % or _ chars
% - a % char.

Related

Regex when pattern involves dollar sign ($)

I'm running into a bit of an issue when it comes to matching sub-patterns that involve the dollar sign. For example, consider the following chunk of text:
(en $) foo
oof ($).
ofo (env. 80 $US)
I'm using the following regex :
Pattern p = Pattern.compile(
"\\([\\p{InARABIC}\\s]+\\)|\\([\\p{InBasic_Latin}\\s?\\$]+\\)|\\)([\\p{InARABIC}\\s]+)\\(",
Pattern.CASE_INSENSITIVE);
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return m.group();
} else {
return "(" + match.group(1) + ")";
}
});
return replacement;
}
but can't match text containing $
This code is similar to replaceAll(regex, replacement). Problem is that $ isn't only special in regex argument, but also in replacement where it can be used as reference to match from groups like $x (where x is group ID) or ${groupName} if your regex has (?<groupName>subregex).
This allows us to write code like
String doubled = "abc".replaceAll(".", "$0$0");
System.out.println(doubled); //prints: aabbcc
which will replace each character with its two copies since each character will be matched by . and placed in group 0, so $0$0 represents two repetitions of that matched character.
But in your case you have $ in your text, so when it is matched you are replacing it with itself, so you are using in replacement $ without any information about group ID (or group name) which results in IllegalArgumentException: Illegal group reference.
Solution is to escape that $ in replacement part. You can do it manually, with \, but it is better to use method designed for that purpose Matcher#quoteReplacement (in case regex will evolve and you will need to escape more things, this method should evolve along with regex engine which should save you some trouble later)
So try changing your code to
public String replace(String text) {
Matcher m = p.matcher(text);
String replacement = m.replaceAll(match -> {
if (m.group(1) == null) {
return Matcher.quoteReplacement(m.group());
// ^^^^^^^^^^^^^^^^^^^^^^^^
} else {
return Matcher.quoteReplacement("(" + match.group(1) + ")");
// ^^^^^^^^^^^^^^^^^^^^^^^^
}
});
return replacement;
}
}

How to Determine if a String starts with exact number of zeros?

How can I know if my string exactly starts with {n} number of leading zeros?
For example below, the conditions would return true but my real intention is to check if the string actually starts with only 2 zeros.
String str = "00063350449370"
if (str.startsWith("00")) { // true
...
}
You can do something like:
if ( str.startsWith("00") && ! str.startsWith("000") ) {
// ..
}
This will make sure that the string starts with "00", but not a longer string of zeros.
You can try this regex
boolean res = s.matches("00[^0]*");
How about?
final String zeroes = "00";
final String zeroesLength = zeroes.length();
str.startsWith(zeroes) && (str.length() == zeroes.length() || str.charAt(zeroes.length()) != '0')
Slow but:
if (str.matches("(?s)0{3}([^0].*)?") {
This uses (?s) DOTALL option to let . also match line-breaks.
0{3} is for 3 matches.
How about using a regular expression?
0{n}[^0]*
where n is the number of leading '0's you want. You can utilise the Java regex API to check if the input matches the expression:
Pattern pattern = Pattern.compile("0{2}[^0]*"); // n = 2 here
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
// code
}
You can use a regular expression to evaluate the String value:
String str = "00063350449370";
String pattern = "[0]{2}[1-9]{1}[0-9]*"; // [0]{2}[1-9]{1} starts with 2 zeros, followed by a non-zero value, and maybe some other numbers: [0-9]*
if (Pattern.matches(pattern, str))
{
// DO SOMETHING
}
There might be a better regular expression to resolve this, but this should give you a general idea how to proceed if you choose the regular expression path.
The long way
String TestString = "0000123";
Pattern p = Pattern.compile("\\A0+(?=\\d)");
Matcher matcher = p.matcher(TestString);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(" Group: " + matcher.group());
}
Your probably better off with a small for loop though
int leadZeroes;
for (leadZeroes=0; leadZeroes<TestString.length(); leadZeroes++)
if (TestString.charAt(leadZeroes) != '0')
break;
System.out.println("Count of Leading Zeroes: " + leadZeroes);

How to return the first chunk of either numerics or letters from a string?

For example, if I had (-> means return):
aBc123afa5 -> aBc
168dgFF9g -> 168
1GGGGG -> 1
How can I do this in Java? I assume it's something regex related but I'm not great with regex and so not too sure how to implement it (I could with some thought but I have a feeling it would be 5-10 lines long, and I think this could be done in a one-liner).
Thanks
String myString = "aBc123afa5";
String extracted = myString.replaceAll("^([A-Za-z]+|\\d+).*$", "$1");
View the regex demo and the live code demonstration!
To use Matcher.group() and reuse a Pattern for efficiency:
// Class
private static final Pattern pattern = Pattern.compile("^([A-Za-z]+|\\d+).*$");
// Your method
{
String myString = "aBc123afa5";
Matcher matcher = pattern.matcher(myString);
if(matcher.matches())
System.out.println(matcher.group(1));
}
Note: /^([A-Za-z]+|\d+).*$ and /^([A-Za-z]+|\d+)/ both works in similar efficiency. On regex101 you can compare the matcher debug logs to find out this.
Without using regex, you can do this:
String string = "168dgFF9g";
String chunk = "" + string.charAt(0);
boolean searchDigit = Character.isDigit(string.charAt(0));
for (int i = 1; i < string.length(); i++) {
boolean isDigit = Character.isDigit(string.charAt(i));
if (isDigit == searchDigit) {
chunk += string.charAt(i);
} else {
break;
}
}
System.out.println(chunk);
public static String prefix(String s) {
return s.replaceFirst("^(\\d+|\\pL+|).*$", "$1");
}
where
\\d = digit
\\pL = letter
postfix + = one or more
| = or
^ = begin of string
$ = end of string
$1 = first group `( ... )`
An empty alternative (last |) ensures that (...) is always matched, and always a replace happens. Otherwise the original string would be returned.

Recursive replace with Java regular expression?

I can replace ABC(10,5) with (10)%(5) using:
replaceAll("ABC\\(([^,]*)\\,([^,]*)\\)", "($1)%($2)")
but I'm unable to figure out how to do it for ABC(ABC(20,2),5) or ABC(ABC(30,2),3+2).
If I'm able to convert to ((20)%(2))%5 how can I convert back to ABC(ABC(20,2),5)?
Thanks,
j
I am going to answer about the first question. I was not able to do the task in a single replaceAll. I don't think it is even achievable. However if I use loop then this should do the work for you:
String termString = "([0-9+\\-*/()%]*)";
String pattern = "ABC\\(" + termString + "\\," + termString + "\\)";
String [] strings = {"ABC(10,5)", "ABC(ABC(20,2),5)", "ABC(ABC(30,2),3+2)"};
for (String str : strings) {
while (true) {
String replaced = str.replaceAll(pattern, "($1)%($2)");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
System.out.println(str);
}
I am assuming you are writing parser for numeric expressions, thus the definition of term termString = "([0-9+\\-*/()%]*)". It outputs this:
(10)%(5)
((20)%(2))%(5)
((30)%(2))%(3+2)
EDIT As per the OP request I add the code for decoding the strings. It is a bit more hacky than the forward scenario:
String [] encoded = {"(10)%(5)", "((20)%(2))%(5)", "((30)%(2))%(3+2)"};
String decodeTerm = "([0-9+\\-*ABC\\[\\],]*)";
String decodePattern = "\\(" + decodeTerm + "\\)%\\(" + decodeTerm + "\\)";
for (String str : encoded) {
while (true) {
String replaced = str.replaceAll(decodePattern, "ABC[$1,$2]");
if (replaced.equals(str)) {
break;
}
str = replaced;
}
str = str.replaceAll("\\[", "(");
str = str.replaceAll("\\]", ")");
System.out.println(str);
}
And the output is:
ABC(10,5)
ABC(ABC(20,2),5)
ABC(ABC(30,2),3+2)
You can start evaluating the inner most reducable expressions first, till no more redux exists. However you have to take care of other ,, ( and ). The solution of #BorisStrandjev is better, more bullet proof.
String infix(String expr) {
// Use place holders for '(' and ')' to use regex [^,()].
expr = expr.replaceAll("(?!ABC)\\(", "<<");
expr = expr.replaceAll("(?!ABC)\\)", ">>");
for (;;) {
String expr2 = expr.replaceAll("ABC\\(([^,()]*)\\,([^,()]*)\\)",
"<<$1>>%<<$2>>");
if (expr2 == expr)
break;
expr = expr2;
}
expr = expr.replaceAll("<<", ")");
expr = expr.replaceAll(">>", ")");
return expr;
}
You could use this Regular Expressions library https://github.com/florianingerl/com.florianingerl.util.regex , that also supports Recursive Regular Expressions.
Converting ABC(ABC(20,2),5) to ((20)%(2))%(5) looks like this:
Pattern pattern = Pattern.compile("(?<abc>ABC\\((?<arg1>(?:(?'abc')|[^,])+)\\,(?<arg2>(?:(?'abc')|[^)])+)\\))");
Matcher matcher = pattern.matcher("ABC(ABC(20,2),5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("abc".equals(node.getGroupName())) {
return "(" + replace(node.getChildren().get(0)) + ")%(" + replace(node.getChildren().get(1)) + ")";
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("((20)%(2))%(5)", replacement);
Converting back again, i.e. from ((20)%(2))%(5) to ABC(ABC(20,2),5) looks like this:
Pattern pattern = Pattern.compile("(?<fraction>(?<arg>\\(((?:(?'fraction')|[^)])+)\\))%(?'arg'))");
Matcher matcher = pattern.matcher("((20)%(2))%(5)");
String replacement = matcher.replaceAll(new DefaultCaptureReplacer() {
#Override
public String replace(CaptureTreeNode node) {
if ("fraction".equals(node.getGroupName())) {
return "ABC(" + replace(node.getChildren().get(0)) + "," + replace(node.getChildren().get(1)) + ")";
} else if ("arg".equals(node.getGroupName())) {
return replace(node.getChildren().get(0));
} else
return super.replace(node);
}
});
System.out.println(replacement);
assertEquals("ABC(ABC(20,2),5)", replacement);
You can try to rewrite the string using the Polish notation and then replace any % X Y with ABC(X,Y).
Here's the wiki link for the Polish notation.
The problem is that you need to find out which rewrite of ABC(X,Y) occurred first when you recursively replaced them in your string. The Polish notation is useful for "deciphering" the order that these rewrites occur and is widely used in expression evaluation.
You can do this by using a stack and recording which replace occurred first: find the inner-most set of parentheses, push only that expression onto the stack, then remove that from your string. When you want to reconstruct the expression original expression, just start at the top of the stack and apply the reverse transformation (X)%(Y) -> ABC(X,Y).
This is somewhat a form of the Polish notation, with the only difference being that you don't store the entire expression as a string, but rather store it in a stack for easier processing.
In short, when replacing, start with the inner-most terms (the ones that have no parentheses in them) and apply the reverse replace.
It may be helpful to use (X)%(Y) -> ABC{X,Y} as an intermediary rewrite rule, then rewrite the curly brackets as round brackets. This way it will be easier to determine which is the inner-most term, as the new terms won't use round brackets. Also it is easier to implement, but not as elegant.

How to determine where a regex failed to match using Java APIs

I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.
I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.
Is this possible to do?
If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LastMatch {
private static int indexOfLastMatch(Pattern pattern, String input) {
Matcher matcher = pattern.matcher(input);
for (int i = input.length(); i > 0; --i) {
Matcher region = matcher.region(0, i);
if (region.matches() || region.hitEnd()) {
return i;
}
}
return 0;
}
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
String[] samples = {
"*ABC",
"A1b*",
"AB12uv",
"AB12uv*",
"ABCDabc",
"ABC123X"
};
for (String sample : samples) {
int lastMatch = indexOfLastMatch(pattern, sample);
System.out.println(sample + ": last match at " + lastMatch);
}
}
}
The output of this class is:
*ABC: last match at 0
A1b*: last match at 3
AB12uv: last match at 6
AB12uv*: last match at 6
ABCDabc: last match at 4
ABC123X: last match at 6
You can take the string, and iterate over it, removing one more char from its end at every iteration, and then check for hitEnd():
int farthestPoint(Pattern pattern, String input) {
for (int i = input.length() - 1; i > 0; i--) {
Matcher matcher = pattern.matcher(input.substring(0, i));
if (!matcher.matches() && matcher.hitEnd()) {
return i;
}
}
return 0;
}
You could use a pair of replaceAll() calls to indicate the positive and negative matches of the input string. Let's say, for example, you want to validate a hex string; the following will indicate the valid and invalid characters of the input string.
String regex = "[0-9A-F]"
String input = "J900ZZAAFZ99X"
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
String mask = m.replaceAll('+').replaceAll('[^+]', '-')
System.out.println(input)
System.out.println(mask)
This would print the following, with a + under valid characters and a - under invalid characters.
J900ZZAAFZ99X
-+++--+++-++-
If you want to do it outside of the code, I use rubular to test the regex expressions before sticking them in the code.

Categories