Split Comma delimited String excluding those in brackets within the brackets - java

I have the following String to be split:
Given String:
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0],[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0],[PSR_Net__123456_A,[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
Expected Results: (3 elements)
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR__123456_A,[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR_Net__123456_A,[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
I have tried the following regular expressions to parse/split the above string:
",(?![^[]*[]])" or ",(?=(((?!]).)*\[)|[^\[\]]*$)"
but still I cannot achieve the expected results, but rather it gives me the following results (6 elements) instead:
[PSR__123456_A
[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR__123456_A
[AgrID=123456,PoolID=A],,Auto,Bank,0,0]
[PSR_Net__123456_A
[AgrID=123456,PoolID=A],,Suppress_Collateral,Bank,0,0]
Is there a way to do this in Java (RegEx) without splitting the String character by character?

If you want to select the comma when and what is at the right should be 2 times an opening and 2 times a closing square bracket, you might use:
,(?=\[[^[]*\[[^[]*\][^]]*\])
In Java:
String regex = ",(?=\\[[^\\[]*\\[[^\\[]*\\][^]]*\\])";
See the Regex demo | Java demo
That will match:
, Match comma
(?= Positive lookahead
\[[^[]*\[[^[]*\][^]]+\] matches:
\[ Match [
[^[]* Negated character class not matching [
\[ Match [
[^[]* Negated character class not matching [
\] Match ]
[^]]* Negated character class not matching ]
\] Match ]
) Close positive lookahead

Assuming that your first elements start with [PSR, then you can use a regex with positive lookahead like this:
,(?=\[PSR)
Working demo
With \n as replacement string
Update: as Manish described in his comment, you can actually use ],[ with ]\n[ as replacement string
Working demo

Related

How to extract and replace a String with specific format?

I have input String like;
(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)
What I want to do is find all words starting with "rm" and replace them with remove function.
(remove(01ADS21212), 'adfffddd', remove(Adssssss), '1231232131', remove(2321312322))
I am trying to use replaceAll function but I don't know how to extract parts after "rm" literal.
statement.replaceAll("\\(rm*.,", "remove($1)");
Is there any way to get these parts?
You have not captured any substring with a capturing group, thus $1 is null.
You may use
.replaceAll("\\brm(\\w*)", "remove($1)")
See the regex demo
Details
\b - a word boundary (to start matching only at the start of a word)
rm - a literal part
(\w*) - Group 1: 0+ word chars (letters, digits or underscores)
The $1 in the replacement pattern stands for Group 1 value.
If you mean to match any chars other than a comma and whitespace after rm, use "\\brm([^\\s,]*)", see this regex demo.
Use "Replace" with empty string .
Eg;
string str = "(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)";
Console.WriteLine(str.Replace("rm", ""));
Output : (01ADS21212, 'adfffddd', Adssssss, '1231232131', 2321312322)

Java Regex Subexpressions

I'm trying to create a regex pattern to match a specific string and return true if the string matches the pattern and false if it doesn't. Here are the conditions:
Must start with [ and end with ]
Each item inside the brackets have to be separated by commas
Each item separated by commas have to follow this regex pattern:
^[A-Za-z][A-Za-z0-9_]*$
How can I make one regex that checks for all these conditions?
Enclose in the group which could repeat:
\[[A-Za-z][A-Za-z0-9_]*(?:,[A-Za-z][A-Za-z0-9_])*\]
This is as it should appear in the final string. Escape specials according to specific language.
In Java, \w without the Pattern.UNICODE_CHARACTER_CLASS flag actually matches the same as [a-zA-Z0-9_]. So, I'd use
String pat = "\\[[a-zA-Z]\\w*(?:,[a-zA-Z]\\w*)*]";
See the IDEONE demo. Use with String#matches, or you will have to add ^ (or \\A) at the beginning and $ (or \\z) at the end.
String pat = "\\[[a-zA-Z]\\w*(?:,[a-zA-Z]\\w*)*]";
System.out.println("[c1,T4,yu5]".matches(pat)); // TRUE
Pattern explanation:
\\[ - a literal [
[a-zA-Z] - an English letter (same as \\p{Alpha})
\\w* - zero or more characters from [a-zA-Z0-9_] set
(?: - start of the non-capturing group matching...
, - a comma
[a-zA-Z]\\w* - see above
)* - ... zero or more times
] - a literal ] (does not require escaping outside of the character class to be treated literally).

How to replace dashes with underscores within the square brackets using regex Java

I am trying to replace dashes within the square brackets with underscores but it replaces all dashes with underscores in string.
For example, I want to replace
"[a]-[a-gamma]"
with
"[a]-[a_gamma]"
but it replaces all dashes from the string with underscores.
You can use
String n="[a]-[a-gamma]";
System.out.println(n.replaceAll("-(?=[^\\[\\]]*\\])", "_"));
As for the regex itself, I match the - symbol only if it is followed by non-[s and non-]s until the engine finds the ]. Then, we are "inside" the []s. There can be a situation when this is not quite true (4th hyphen in [a-z]-[a-z] - ] [a-z]), but I hope it is not your case.
IDEONE Demo
Output:
[a]-[a_gamma]
Use a negative lookahead:
str = str.replaceAll("-(?![^\\]]*\\[)", "_");
The regex matches dashes whose next square bracket character is not an opening square bracket.
-(?=[^\\[]*\\])
You can use this.See demo.
https://regex101.com/r/bN8dL3/6
If your brackets are balanced (or if an unclosed bracket is considered opened by default until the end), you can use this way that needs few steps to find a match:
pattern:
((?:\\G(?!\\A)|[^\\[]*\\[)[^\\]-]*)-
replacement:
$1_
demo
pattern details:
( # open the capture group 1
(?: # open a non capturing group for the 2 possible beginings
\\G (?!\\A) # this one succeeds immediately after the last match
|
[^\\[]* \\[ # this one reach the first opening bracket
# (so it is the first match)
)
[^\\]-]* # all that is not a closing bracket or a dash
) # close the capture group
- # the dash
The \G anchor marks the position after the last match. But at the begining (since there isn't already a match), it matches by default the start of the string. This is the reason why I added (?!\A) to fail at the start of the string.
How about this?
/\[[^\]]*?(-)[^\[]*?\]/g
Match group extracted:
"[a]-[a-gamma] - [[ - - [123-567567]]]"
^ ^
Explanation available here: https://regex101.com/r/oC2xE0/1

What is the meaning of [...] regex?

I am new to regex going through the tutorial I found the regex [...] says Matches any single character in brackets.. So I tried
System.out.println(Pattern.matches("[...]","[l]"));
I also tried escaping brackets
System.out.println(Pattern.matches("[...]","\\[l\\]"));
But it gives me false I expected true because l is inside brackets.
It would be helpful if anybody clear my doubts.
Characters that are inside [ and ] (called a character class) are treated as a set of characters to choose from, except leading ^ which negates the result and - which means range (if it's between two characters). Examples:
[-123] matches -, 1, 2 or 3
[1-3] matches a single digit in the range 1 to 3
[^1-3] matches any character except any of the digits in the range 1 to 3
. matches any character
[.] matches the dot .
If you want to match the string [l] you should change your regex to:
System.out.println(Pattern.matches("...", "[l]"));
Now it prints true.
The regex [...] is equivalent to the regexes \. and [.].
The tutorial is a little misleading, it says:
[...] Matches any single character in brackets.
However what it means is that the regex will match a single character against any of the characters inside the brackets. The ... means "insert characters you want to match here". So you need replace the ... with the characters that you want to match against.
For example, [AP]M will match against "AM" and "PM".
If your regex is literally [...] then it will match against a literal dot. Note there is no point repeating characters inside the brackets.
The tutorial is saying:
Matches any single character in brackets.
It means you replace ... with a single character, for example [l]
These will print true:
System.out.println(Pattern.matches("[l]","l"));
System.out.println(Pattern.matches("[.]","."));
System.out.println(Pattern.matches("[.]*","."));
System.out.println(Pattern.matches("[.]*","......"));
System.out.println(Pattern.matches("[.]+","......"));

Java regex to preserve ngrams in square brackets

I am a bit of a newbie with Java regex so I wonder if anyone can help where I need a regex to split text based on ngrams. So if I have text like this:
dyson [salisbury matheson beaumont] clarke [carstairs morden] vaughan
To return the following ngrams:
Unigram: dyson
Trigram: salisbury matheson beaumont
Unigram: clarke
Bigram: carstairs morden
Unigram: vaughan
The contents of the square brackets are preserved as bigrams or trigrams?
The split would be based upon spaces outside the brackets.
That's pretty easy:
\w+|\[([\w\s]+)\]
Demo
Explanation:
\w+ matches a word (a series of alphanumeric characters or an underscore)
or: \[([\w\s]+)\]
\[ matches a [
[\w\s]+ matches a series of words and spaces, this is captured
\] matches a ]
If you have a capture it means you have something in brackets, else it means you have a single word. You can then apply the simple \w+ regex to the contents of the brackets to extract the words.
To use it in Java you have to escape the backslashes to pass them as-is to the regex engine:
String pattern = "\\w+|\\[([\\w\\s]+)\\]";

Categories