high-level regular expression with not

high-level regular expression with not - java

Hi regular expression experts,
I have the following text
<[~UNKNOWN:a-z\.]> <[~UNKNOWN:A-Z\-0-9]> <[~UNKNOWN:A-Z\]a-z]
And the following reg expr
\[\~[^\[\~\]]*\]
It works fine for the 1st and 2nd group in the text but not for the 3rd one.
The 1st group is
[~UNKNOWN:a-z\.]
The 2nd is
[~UNKNOWN:A-Z\-0-9]
and the 3rd one is
[~UNKNOWN:A-Z\]a-z]
However the reg exp finds the following text
[~UNKNOWN:A-Z\]
I understand why and I know that I have to add the following rule to the reg exp:
starting with '[' and '~' characters and ending with ']' UNLESS there is a '\' in front of ']'. So I should add a NOT expression but not sure how.
Could anybody please help?
Thanks,
V.

Why not simply:
<([^>]+)>?
Regex Demo

This should work (first line pattern, second line your pattern (ignore whitespace), third line my changes):
\[\~(?:[^\[\~\]]|(?<=\\)\])*(?<!\\)\]
\[\~ [^\[\~\]] * \]
(?: |(?<=\\)\]) (?<!\\)
Your regex:
\[\~ # Literal characters [~
[^ # Character group, NONE of the following:
\[\~\] # [ or ~ or ]
]* # 0 or more of this character group
\] # Followed by ]
Your pattern in words: [~, everything in between, up to the next ], as long as there is no [ or ~ or ] in there.
My pattern , only relevant changes explained:
\[\~
(?: # Non capturing group
[^\[\~\]]
| # OR
(?<=\\)\] # ], preceded by \
)*
(?<!\\)\] # ], not preceded by \
In words: Same as yours, plus ] may be contained if it is preceded by \, and the closing ] may not be preceded by \

Related

match repeated blocks separated by ~

To match the following text:
text : SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES
It has this format:<code1>~<description1>\<code2>~<description2>\<code3>~<description3>....<codeN>~<descriptionN>
I used this regex: [A-Z0-9 ]+~[A-Z0-9 ]+(?:\\[A-Z0-9 ]+~[A-Z0-9 ]+)+
So:
case 1. SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES (Match: OK)
case 2. SS~B66\88~PRELIMINARY PAGES~HEADING PAGES (No Match: OK because I removed the code 'M01')
case 3. SS~B66~PRELIMINARY PAGES\M01~HEADING PAGES (No Match: OK because I removed the code '88')
More examples:
SS~B66\88~MEKLKE\M01~MOIIE
B~A310\0~PRELIM#INARY\00-00~HEADING
My problem is that <code> and <description> can accept any type of characters, so when I replaced my regex with:
My new regex .+~.+(?:\\.+~.+)+ , but it can match case 2 and case 3.
Thank you for your help.

Instead of using [A-Z0-9 ] which would not match all the allowed chars, or .+ which would match too much, you can use a negated character class [^~\\] matching any char except \ and ~ to set the boundaries for the matched parts.
^[^~]+~[^~\\]+(?:\\[^~]+~[^~\\]+)+$
^ Start of string
[^~]+~ Match any char other than ~, then match ~
[^~\\]+ Repeat matching 1+ times any char other than ~ and `
(?: Non capture group
\\[^~]+~[^~\\]+ Match \ and a ~ between other chars than ~ before and ~ \ after
)+ Close the group and repeat 1 or more times to match at least a \
$ End of string
Regex demo (The demo contains \n to not cross the newlines in the example data)

Regex: How to remove a substring that is bounded by certain characters?

I'm not sure what the appropriate regex expression would be for this:
String s = "[Don't remove] Don't remove [Remove | Don't remove]";
I want to remove everything in between [ and | but not [ and ]. So the output is:
"[Don't remove] Don't remove Don't remove]"
I tried doing this,
s = s.replaceAll("\\[.*?\\|", "");
but I end up getting something like this.
"Don't remove]"
Now I'm at a lost. I'm still new to regular expressions and any help would be greatly appreciated. Thanks!

Use a ngated character class [^\[|]* that will not allow matching any other [ and | in between [ and |:
String s = "[Don't remove] Don't remove [Remove | Don't remove]";
s = s.replaceAll("\\[[^\\[|]*\\|", "");
System.out.println(s); // => [Don't remove] Don't remove Don't remove]
See a regex demo and an online Java demo.
Details
\\[ - a literal [
[^\\[|]* - a negated character class matching any 0+ chars other than a [ and |
\\| - a literal | symbol.

Get equations from string

I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.

You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )

I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+

Regex match repeating pattern only after string

let a PropDefinition be a string of the form prop\d+ (true|false)
I have a string like:
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
I'd like to extract the bottom PropDefinitions only after the text 'sat', so the matches should be:
prop0 false
prop1 false
prop2 true
I originally tried using /(prop\d (?:true|false))/s (see example here) but that obviously matches all PropDefinitions and I couldn't make it match repeats only after the sat string
I used rubular as an example above because it was convenient, but I'm really looking for the most language agnostic solution. If it's vital info, I'll most likely be using the regex in a Java application.

str =<<-Q
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
Q
p str[/^sat(.*)/m, 1].scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]

When you have patterns that are very different in nature as in this case (string after sat and selecting the specific patterns), it is usually better to express them in multiple regexes rather than trying to do it with a single regex.
s = <<_
((prop5 true))
sat
((prop0 false)
(prop1 false)
(prop2 true))
_
s.split(/^sat\s+/, 2).last.scan(/prop\d+ (?:true|false)/)
# => ["prop0 false", "prop1 false", "prop2 true"]

\s+[(]+\K(prop\d (?:true|false)(?=[)]))
Live demo

If Ruby can support the \G anchor this is one solution.
It looks nasty, but several things are going on.
1. It only allows a single nest (outer plus many inners)
2. It will not match invalid forms that don't comply with '(prop\d true|false)'
Without condition 2, it would be alot easier which is an indicator that a two regex
solution would do the same. First to capture the outer form sat((..)..(..)..)
second to globally capture the inner form (prop\d true|false).
Can be done in a single regex, though this is going to be hard to look at, but should work (test case below in Perl).
# (?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))
(?:
(?! \A | sat \s* \( )
\G # Start match from end of last match
| # or,
sat \s* \( # Start form 'sat ('
)
[^()]* # This check section consumes invalid inner '(..)' forms
(?: # since we are looking specifically for '(prop\d true|false)'
\(
(?!
prop \d [ ]
(?: true | false )
\)
)
[^()]*
\)
[^()]*
)* # End section, do optionally many times
\(
( # (1 start), match inner form '(prop\d true|false)'
prop \d [ ]
(?: true | false )
) # (1 end)
\)
(?= # Look ahead for end form '(..)(..))'
(?:
[^()]*
\( [^()]* \)
)*
[^()]*
\)
)
Perl test case
$/ = undef;
$str = <DATA>;
while ($str =~ /(?:(?!\A|sat\s*\()\G|sat\s*\()[^()]*(?:\((?!prop\d[ ](?:true|false)\))[^()]*\)[^()]*)*\((prop\d[ ](?:true|false))\)(?=(?:[^()]*\([^()]*\))*[^()]*\))/g)
{
print "'$1'\n";
}
__DATA__
((prop10 true))
sat
((prop3 false)
(asdg)
(propa false)
(prop1 false)
(prop2 true)
)
((prop5 true))
Output >>
'prop3 false'
'prop1 false'
'prop2 true'

Part of the confusion has to do with SingleLine vs MultiLine matching. The patterns below work for me and return all matches in a single execution and without requiring a preliminary operation to split the string.
This one requires SingleLine mode to be specified separately (as in .Net RegExOptions):
(?<=sat.*)(prop\d (?:true|false))
This one specifies SingleLine mode inline which works with many, but not all, RegEx engines:
(?s)(?<=sat.*)(?-s)(prop\d (?:true|false))
You don't need to turn SingleLine mode off via the (?-s) but I think it is clearer in its intent.
The following pattern also toggles SingleLine mode inline, but uses a Negative LookAhead instead of a Positive LookBehind as it seems (according to regular-expressions.info [be sure to select Ruby and Java from the drop-downs]) the Ruby engine doesn't support LookBehinds--Positive or Negative--depending on the version, and even then doesn't allow quantifiers (also noted by #revo in a comment below). This pattern should work in Java, .Net, most likely Ruby, and others:
(prop\d (?:true|false))(?s)(?!.*sat)(?-s)

/(?<=sat).*?(prop\d (true|false))/m
Match group 1 is what you want. See example.
BUT, I would really recommend split the string first. It's much easier.

Regular expression for no whitespaces on the first position

Example accepted:
This is a try!
And this is the second line!
Example not accepted:
this is a try with initial spaces
and this the second line
So, I need:
no string made only by whitespaces " "
no string where first char is whitespace
new lines are ok; only the first character cannot be a new line
I was using
^(?=\s*\S).*$
but that pattern can't allow new lines.

You can try this regex
^(?!\s*$|\s).*$
---- -- --
| | |->matches everything!
| |->no string where first char is whitespace
|->no string made only by whitespaces
you need to use singleline mode ..
you can try it here..you need to use matches method

"no string made only by whitespaces" is the same to "no string where first char is whitespace" as it also begins with white space.
You have to set Pattern.MULTILINE which changes the meaning of ^ and $ also to begin and end of line, not only entire string
"^\\S.+$"

I'm not a Java guy, but a solution in Python could look like this here:
In [1]: import re
In [2]: example_accepted = 'This is a try!\nAnd this is the second line!'
In [3]: example_not_accepted = ' This is a try with initial spaces\nand this the second line'
In [4]: pattern = re.compile(r"""
....: ^ # matches at the beginning of a string
....: \S # matches any non-whitespace character
....: .+ # matches one or more arbitrary characters
....: $ # matches at the end of a string
....: """,
....: flags=re.MULTILINE|re.VERBOSE)
In [5]: pattern.findall(example_accepted)
Out[5]: ['This is a try!', 'And this is the second line!']
In [6]: pattern.findall(example_not_accepted)
Out[6]: ['and this the second line']
The key part here is the flag re.MULTILINE. With this flag enabled, ^ and $ do not only match at the beginning and end of a string, but also at the beginning and end of lines which are separated by newlines. I'm sure there is something equivalent for Java as well.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

high-level regular expression with not - java

Why not simply: <([^>]+)>? Regex Demo

Related

match repeated blocks separated by ~

Regex: How to remove a substring that is bounded by certain characters?

Get equations from string

Regex match repeating pattern only after string

Regular expression for no whitespaces on the first position

Categories

Resources