Java Regex Subexpressions - java

I'm trying to create a regex pattern to match a specific string and return true if the string matches the pattern and false if it doesn't. Here are the conditions:
Must start with [ and end with ]
Each item inside the brackets have to be separated by commas
Each item separated by commas have to follow this regex pattern:
^[A-Za-z][A-Za-z0-9_]*$
How can I make one regex that checks for all these conditions?

Enclose in the group which could repeat:
\[[A-Za-z][A-Za-z0-9_]*(?:,[A-Za-z][A-Za-z0-9_])*\]
This is as it should appear in the final string. Escape specials according to specific language.

In Java, \w without the Pattern.UNICODE_CHARACTER_CLASS flag actually matches the same as [a-zA-Z0-9_]. So, I'd use
String pat = "\\[[a-zA-Z]\\w*(?:,[a-zA-Z]\\w*)*]";
See the IDEONE demo. Use with String#matches, or you will have to add ^ (or \\A) at the beginning and $ (or \\z) at the end.
String pat = "\\[[a-zA-Z]\\w*(?:,[a-zA-Z]\\w*)*]";
System.out.println("[c1,T4,yu5]".matches(pat)); // TRUE
Pattern explanation:
\\[ - a literal [
[a-zA-Z] - an English letter (same as \\p{Alpha})
\\w* - zero or more characters from [a-zA-Z0-9_] set
(?: - start of the non-capturing group matching...
, - a comma
[a-zA-Z]\\w* - see above
)* - ... zero or more times
] - a literal ] (does not require escaping outside of the character class to be treated literally).

Related

Regex for matching a character later in the string if a certain character is present before

Let's say I have the following string
['json.key']
I want a regex pattern that will match the entire string because it contains the matching closing '] to the opening ['.
But sometimes the [' and '] don't have to exist, and it should be okay too.
jsonKey
But I don't want strings like these to match
['jsonKey
jsonKey']
Because they are missing the matching [' and '].
The current regex pattern I have for this is
(\[')?[\w-]+('])?
But this doesn't quite work because it lets the two last cases pass.
I need a regex pattern for Java and JavaScript code. But they are separate modules, it could be different patterns.
In Java or Javascript you can use alternation and look arounds like this:
(?<!\S)(?:\['[\w-]+']|[\w-]+)(?!\S)
RegEx Demo
RegEx Details:
(?<!\S): Assert that previous char is not a non-whitespace
(?:: Start non-capture group
\['[\w-]+']: Match ['<1+ word char>']
|: OR
[\w-]+: Match 1+ of word char or hyphen
): End non-capture group
(?!\S): Assert that next char is not a non-whitespace

Java Regex Troubles

I have a string that needs to be extracted using regex. It’s preferable that only a single regex is used. As it’s used in a loop with 9 pre-existing Regex’s.(Ie, so i can just add it to the ArrayList of available regex's)
The pattern of strings will always be
Between {4,8} A-Z0-9. Followed by either,
[A-Z]{1} or [A-Z0-9]{2} or, another [A-Z0-9]{4,8}
For example:
“A1B1C1 ABCD E FGHI JK X0Y0Z0”
I’d want this to return four matches.
A1B1C1 & ABCD E & FGHI JK & X0Y0Z0
I've been trying to match the first part of {4,8} chatactures, followed by a non-greedy match for {1,2}. For example(s):
[A-Z0-9]{4,8}(\\s{1}[A-Z0-9]{1,2})*? && [A-Z0-9]{4,8}(\\s{1}[A-Z]{1}|\\s{1}[A-Z0-9]{2})*?
But this never returns more than the first {4,8} charactures.
You could use an optional part with a word boundary and an alternation to match either [A-Z0-9]{2} or [A-Z]
\b[A-Z0-9]{4,8}(?:\h+(?:[A-Z0-9]{2}|[A-Z]))?\b
\b Word boundary
[A-Z0-9]{4,8} Match 4 - 8 times A-Z0-9
(?: Non capture group
\h+ Match 1+ horizontal whitespace chars
(?:[A-Z0-9]{2}|[A-Z]) Match either 2 x A-Z0-9 or 1 x A-Z
)? Close non capture group and make it optional
\b Word boundary
Regex demo | Java demo
In Java
String regex = "\\b[A-Z0-9]{4,8}(?:\\h+(?:[A-Z0-9]{2}|[A-Z]))?\\b";

Split Comma delimited String excluding those in brackets within the brackets

I have the following String to be split:
Given String:
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0],[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0],[PSR_Net__123456_A,[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
Expected Results: (3 elements)
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR_Net__123456_A,[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
I have tried the following regular expressions to parse/split the above string:
",(?![^[]*[]])" or ",(?=(((?!]).)*\[)|[^\[\]]*$)"
but still I cannot achieve the expected results, but rather it gives me the following results (6 elements) instead:
[PSR__123456_A
[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR__123456_A
[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR_Net__123456_A
[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
Is there a way to do this in Java (RegEx) without splitting the String character by character?
If you want to select the comma when and what is at the right should be 2 times an opening and 2 times a closing square bracket, you might use:
,(?=\[[^[]*\[[^[]*\][^]]*\])
In Java:
String regex = ",(?=\\[[^\\[]*\\[[^\\[]*\\][^]]*\\])";
See the Regex demo | Java demo
That will match:
, Match comma
(?= Positive lookahead
\[[^[]*\[[^[]*\][^]]+\] matches:
\[ Match [
[^[]* Negated character class not matching [
\[ Match [
[^[]* Negated character class not matching [
\] Match ]
[^]]* Negated character class not matching ]
\] Match ]
) Close positive lookahead
Assuming that your first elements start with [PSR, then you can use a regex with positive lookahead like this:
,(?=\[PSR)
Working demo
With \n as replacement string
Update: as Manish described in his comment, you can actually use ],[ with ]\n[ as replacement string
Working demo

How to restrict occurrence of a character in regex?

I want to check if a string consists of letters and digits only, and allow a - separator:
^[\w\d-]*$
Valid: TEST-TEST123
Now I want to check that the separator occurs only once at a time. Thus the following examples should be invalid:
Invalid: TEST--TEST, TEST------TEST, TEST-TEST--TEST.
Question: how can I restrict the repeated occurrence of the a character?
You may use
^(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*)?$
Or, in Java, you may use an alphanumeric \p{Alnum} character class to denote letters and digits:
^(?:\p{Alnum}+(?:-\p{Alnum}+)*)?$
See the regex demo
Details
^ - start of the string
(?: - start of an optional non-capturing group (it will ensure the pattern matches an empty string, if you do not need it, remove this group!)
\p{Alnum}+ - 1 or more letters or digits
(?:-\p{Alnum}+)* - zero or more repetitions of
- - a hyphen
\p{Alnum}+ - 1 or more letters or digits
)? - end of the optional non-capturing group
$ - end of string.
In code, you do not need the ^ and $ anchors if you use the pattern in the matches method since it anchors the match by default:
Boolean valid = s.matches("(?:\\p{Alnum}+(?:-\\p{Alnum}+)*)?");

Regular expression match fails if only whitespace after the - character

I am working on a regular expression where the pattern is:
1.0.0[ - optional description]/1.0.0.0[ - optional description].txt
The [ - optional description] part is of course, optional. So some possible VALID values are
1.0.0/1.0.0.0.txt
1.0.0/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0 - abc/1.0.0.0.txt
To be a little more robust in the pattern matching, I'd like to match zero or more spaces before and after the "-" character. So all these would be valid too.
1.0.0 - abc/1.0.0.0 - xyz.txt
1.0.0-abc/1.0.0.0-xyz.txt
1.0.0 -abc/1.0.0.0- xyz.txt
To do this matching, I have the following regular expression (Java code):
String part1 = "((\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?)";
String part2 = "((\\d+.{1}\\d+.{1}\\d+.{1}\\d+)(\\s*-\\s*(.+))?\\.sql)";
pattern = Pattern.compile(part1+ "/" + part2);
So far this regular expression is working well. But while unit testing I found a case I can't quite figure out yet. The use case is if the string contains the "-" character is surrounded by 1 or more spaces, but there is no description after the "-" character. This would look like:
1.0.0 - /1.0.0.0.txt
1.0.0- /1.0.0.0-xyz.txt
In these cases, I want the pattern match to FAIL. But with my current regular expression the match succeeds. I think what I want is if there is a "-" character surrounded by any number of spaces like " - " then there must also be at least 1 non-space character following it. But I can't quite figure out the regex for this.
Thanks!
Something like,
^\d+\.\d+\.\d+(?:\s*-\s*\w+)?\/\d+\.\d+\.\d+\.\d+(?:\s*-\s*\w+)?.txt$
Or you can combine the \.\d+ repetitions as
^\d+(?:\.\d+){2}(?:\s*-\s*\w+)?\/\d+(?:\.\d+){3}(?:\s*-\s*\w+)?.txt$
Regex Demo
Changes
.{1} When you want to repeat something once, no need for {}. Its implicit
(?:\s*-\s*\w+) Matches zero or more space (\s*) followed by -, another space and then \w+ a description of length greater than 1
The ? at the end of this patterns makes this optional.
This same pattern is repeated again at the end to match the second part.
^ Anchors the regex at the start of the string.
$ Anchors the regex at the end of the string. These two are necessary so that there is nothing other in the string.
Don't group the patterns using () unless it is necessary to capture them. This can lead to wastage of memory. Use (?:..) If you want to group patterns but not capture them
In the group that matches the optional part, you need to replace .+ with \\S+ where \S means any non-whitespace character. This enforces the optional part to include non-whitespace character in order to match the pattern:
String part1
= "((\\d+\\.\\d+\\.\\d+)(\\s*-\\s*(\\S+))?)";
String part2
= "((\\d+\\.\\d+\\.\\d+.{1}\\d+)(\\s*-\\s*(\\S+))?\\.txt)";
Also note that .{1} (which is the same as just .) matches any character. From the examples, you want to match a dot, so it should be replaced with \.
Something like
^\d+\.\d+\.\d+(?:\s*-\s*[^\/\s]+)?\/\d+\.\d+\.\d+\.\d+?(?:\s*-\s*[^.\s]+)?\.\w+$
Check it out here at regex101.

Categories