Regex replacing everything before a predefined range of chars - Java - java

I have string values where I want to remove or replace everything that comes before "TV|TH". My problem is that despite using the correct syntax, the string seems to stay the same.
String test = "10TH";
String replaceBeforeSide = test.replaceAll("^\\(TH|TV)+", "");
System.out.println(replaceBeforeSide);
//Desired result = "TH";

Converting my comment to answer so that solution is easy to find for future visitors.
You could use a simple regex with a capture group:
replaceBeforeSide = test.replaceAll(".+?(TH|TV)", "$1");
or even shorter:
replaceBeforeSide = test.replaceAll(".+?(T[HV])", "$1");
Using .+?, we are matching 1+ of any character (non-greedy) before matching (TH|TV) that we capture in group #1.
In replacement we just put $1 back so that only string before (TH|TV) is removed.
We could also use a lookahead and avoid capture group:
replaceBeforeSide = test.replaceAll(".+?(?=T[HV])", "");
If you want to match ignore case then use inline modifier (?i):
replaceBeforeSide = test.replaceAll("(?i).+?(?=T[HV])", "");

Related

Regex for matching strings between '=' and '/' or "=" and end of string

I am looking for regex which can help me replace strings like
source=abc/task=cde/env=it --> source='abc'/task='cde'/env='it'
To be more precise, I want to replace a string which starts with = and ends with either / or end of the string with ''
Tried code like this
"source=abc/task=cde/env=it".replaceAll("=(.*?)/","'$1'")
But that results in
source'abc'task'cde'env=it
Using lookahead and look behind:
(?<==)([^/]*)((?=/)|$)
Lookbehind allows you to specify what comes before your match. In this case an equals: (?<==).
The main match in my regex looks for any non-slash character, zero or more times: ([^/]*)
Lookahead allows you to specify what comes after your match. In this case, a slash: (?=/).
The $ matches the end of the line, so that the last item in your test data becomes quoted. ((?=/)|$) combines with this with the lookahead, meaning "either a slash comes after the match or this is the end of the line".
Here it is in action in a test.
#Test
public void test_quote_items() {
String regex = "(?<==)([^/]*)((?=/)|$)";
String actual = "source=abc/task=cde/env=it".replaceAll(regex,"'$1'");
String expected = "source='abc'/task='cde'/env='it'";
assertEquals(expected, actual);
}
Try
String input = "source=abc/task=cde/env=it".replaceAll("=(.*?)(/|$)","='$1'/");
The problems I found are that you are not replacing the =
and also the / is not there for the end of String, that also needs to be replaced when found.
output
source='abc'/task='cde'/env='it'/
If you don't want the last '/', that is trivial to remove isn't it.

Regex pattern error on API 21(android 5) and below

Android 5 and below getting error from my regex pattern on runtime:
java.util.regex.PatternSyntaxException: Syntax error in regexp pattern near index 4:
(?<g1>(http|ftp)(s)?://)?(?<g2>[\w-:#])+(?<TLD>\.[\w\-]+)+(:\d+)?((|\?)([\w\-._~:/?#\[\]#!$&'()*+,;=.%])*)*
Here is code sample:
val urlRegex = "(?<g1>(http|ftp)(s)?://)?(?<g2>[\\w-:#])+(?<TLD>\\.[\\w\\-]+)+(:\\d+)?((|\\?)([\\w\\-._~:/?#\\[\\]#!$&'()*+,;=.%])*)*"
val sampleUrl = "https://www.google.com"
val urlMatchers = Pattern.compile(urlRegex).matcher(sampleUrl)
assert(urlMatchers.find())
This pattern works really fine on all APIs above 21.
It seems the earlier versions do not support named groups. As per this source, the named groups were introduced in Kotlin 1.2. Remove them if you do not need those submatches and only use the regex for validation.
Your regex is very inefficient as it contains a lot of nested quantified groups. See a "cleaner" version of it below.
Also, it seems you want to check if there is a regex match inside your input string. Use Regex#containsMatchIn():
val urlRegex = "(?:(?:http|ftp)s?://)?[\\w:#.-]+\\.[\\w-]+(?::\\d+)?\\??[\\w.~:/?#\\[\\]#!$&'()*+,;=.%-]*"
val sampleUrl = "https://www.google.com"
val urlMatchers = Regex(urlRegex).containsMatchIn(sampleUrl)
println(urlMatchers) // => true
See the Kotlin demo and the regex demo.
If you need to check the whole string match use matches:
Regex(urlRegex).matches(sampleUrl)
See another Kotlin demo.
Note that to define a regex, you need to use the Regex class constructor.

Any suggestion why my regex does not work?

I got the following string to extract some information from:
String: String: String Number;
Right now I'm using the following regex to get the arguments:
(.*?):(.*?):(.*?);$
This way I would get with a Matcher the following output:
group(1) = String
group(2) = String
group(3) = String Number
If I want the number I need to execute another regex on the output of the 3rd group like the following:
([a-zA-Z]* ?([0-9])?$)
Used ont the String String Number this would give me and output like
group(1) = String
group(2) = Number
I thought about combining both steps and use a regex like (.*?):(.*?):([a-zA-Z]* ?([0-9])?);$ on the String: String: String Number;-String. But this does not work and I dont see the reason.
Hwere you go, I added some extra whitespace matching, but this seems to work, you were missing the whitespace between the second : and the following string
^(.*?):\s*(.*?):\s*([a-zA-Z]*\s+([0-9])?);$

Bug in java.util.regex in sun jdk 6.0.24?

The following code blocks on my system. Why?
System.out.println( Pattern.compile(
"^((?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*)/\\*.*?\\*/(.*)$",
Pattern.MULTILINE | Pattern.DOTALL ).matcher(
"\n\n\n\n\n\nUPDATE \"$SCHEMA\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';"
).matches() );
The pattern (designed to detect comments of the form /*...*/ but not within ' or ") should be fast, as it is deterministic...
Why does it take soooo long?
You're running into catastrophic backtracking.
Looking at your regex, it's easy to see how .*? and (.*) can match the same content since both also can match the intervening \*/ part (dot matches all, remember). Plus (and even more problematic), they can also match the same stuff that ((?:[^'"][^'"]*|"[^"]*"|'[^']*')*) matches.
The regex engine gets bogged down in trying all the permutations, especially if the string you're testing against is long.
I've just checked your regex against your string in RegexBuddy. It aborts the match attempt after 1.000.000 steps of the regex engine. Java will keep churning on until it gets through all permutations or until a Stack Overflow occurs...
You can greatly improve the performance of your regex by prohibiting backtracking into stuff that has already been matched. You can use atomic groups for this, changing your regex into
^((?>[^'"]+|"[^"]*"|'[^']*')*)(?>/\*.*?\*/)(.*)$
or, as a Java string:
"^((?>[^'\"]+|\"[^\"]*\"|'[^']*')*)(?>/\\*.*?\\*/)(.*)$"
This reduces the number of steps the regex engine has to go through from > 1 million to 58.
Be advised though that this will only find the first occurrence of a comment, so you'll have to apply the regex repeatedly until it fails.
Edit: I just added two slashes that were important for the expressions to work. Yet I had to change more than 6 characters.... :(
I recommend that you read Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...).
I think it's because of this bit:
(?:[^'\"][^'\"]*|\"[^\"]*\"|'[^']*')*
Removing the second and third alternatives gives you:
(?:[^'\"][^'\"]*)*
or:
(?:[^'\"]+)*
Repeated repeats can take a long time.
For comment /* and */ detection I would suggest having a code like this:
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" /*a comment\n\n*/ SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Pattern pt = Pattern.compile("\"[^\"]*\"|'[^']*'|(/\\*.*?\\*/)",
Pattern.MULTILINE | Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
boolean found = false;
while (matcher.find()) {
if (matcher.group(1) != null) {
found = true;
break;
}
}
if (found)
System.out.println("Found Comment: [" + matcher.group(1) + ']');
else
System.out.println("Didn't find Comment");
For above string it prints:
Found Comment: [/*a comment
*/]
But if I change input string to:
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" '/*a comment\n\n*/' SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
OR
String str = "\n\n\n\n\n\nUPDATE \"$SCHEMA\" \"/*a comment\n\n*/\" SET \"VERSION\" = 12 WHERE NAME = 'SOMENAMEVALUE';";
Output is:
Didn't find Comment

How to split this string using Java Regular Expressions

I want to split the string
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
to
name
employeeno
dob
joindate
I wrote the following java code for this but it is printing only name other matches are not printing.
String fields = "name[Employee Name], employeeno[Employee No], dob[Date of Birth], joindate[Date of Joining]";
Pattern pattern = Pattern.compile("\\[.+\\]+?,?\\s*" );
String[] split = pattern.split(fields);
for (String string : split) {
System.out.println(string);
}
What am I doing wrong here?
Thank you
This part:
\\[.+\\]
matches the first [, the .+ then gobbles up the entire string (if no line breaks are in the string) and then the \\] will match the last ].
You need to make the .+ reluctant by placing a ? after it:
Pattern pattern = Pattern.compile("\\[.+?\\]+?,?\\s*");
And shouldn't \\]+? just be \\] ?
The error is that you are matching greedily. You can change it to a non-greedy match:
Pattern.compile("\\[.+?\\],?\\s*")
^
There's an online regular expression tester at http://gskinner.com/RegExr/?2sa45 that will help you a lot when you try to understand regular expressions and how they are applied to a given input.
WOuld it be better to use Negated Character Classes to match the square brackets? \[(\w+\s)+\w+[^\]]\]
You could also see a good example how does using a negated character class work internally (without backtracking)?

Categories