java regex to exclude specific strings from a larger one - java

I have been banging my head against this for some time now:
I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc.
So having done my regex homework the following regex should work:
(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?
As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects!
Any thoughts?
cheers

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.
Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.
I suggest:
\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b
Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.

So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?
\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b
results:
cos - no match
cosy - full match
cos1 - no match
cosy1 - full match
bla9 - full match
bla99 - no match

i forgot to escape the \b for java so \b should be \\b and it now works.
cheers

Related

Regex return true if even a substring follows the pattern

I was just practicing regex and found something intriguing
for a string
"world9 a9$ b6$" my regular expression "^(?=.*[\\d])(?=\\S+\\$).{2,}$"
will return false as there is a space in between before the look ahead finds the $ sign with at least one digit and non space character.
As a whole the string doesn't matches the pattern.
What should be the regular expression if I want to return true even if a substring follows a pattern?
as in this one a9$ and b6$ both follow the regular expression.
You can use
^(?=\D*\d)(?=.*\S\$).{2,}$
See the regex demo. As The fourth bird mentions, since \S\$ matches two chars, you may simply move the pattern to the consuming part, and use ^(?=\D*\d).*\S\$.*$, see this regex demo.
Details
^ - start of string (implicit if used in .matches())
(?=\D*\d) - a positive lookahead that requires zero or more non-digit chars followed with a digit char immediately to the right of the current location
(?=.*\S\$) - a positive lookahead that requires zero or more chars other than line break chars, as many as possible, followed with a non-whitespace char and a $ char immediately to the right of the current location
.{2,} - any two or more chars other than line break chars, as many as possible
$ - end of string (implicit if used in .matches())
Mostly, knock out the ^ and $ bits, as those force this into a full string match, and you want substring matches. In general, look-ahead seems like a mistake here, what are you trying to accomplish by using that? (Look-ahead/look-behind is rarely needed in general). All you need is:
Pattern.compile("\\S+\\$");
possibly, if you want an element (such as a9$) to stand on its own, use \b which is regexpese for word break: Basically, whitespace (and a few other characters, such as underscores. Most non-letter, non-digits characters are considered a break. Think [^a-zA-Z0-9]) - but \b also matches start/end of input. Thus:
Pattern.compile("\\b\\S+\\$\\b")
still matches foo a9$ bar, or a9$ just fine.
If you MUST put this in terms of a full match, e.g. because matches() (which always does a full string match) is run and you can't change that, well, put ^.* in front and .*$ at the back of it, simple as that.
Absolutely nothing about this says "This can only be needed with lookahead".

Exclude a letter in Regex Pattern

I am trying to create a Regex pattern for <String>-<String>. This is my current pattern:
(\w+\-\w+).
The first String is not allowed to be "W". However, it can still contain "W"s if it's more than one letter long.
For example:
W-80 -> invalid
W42-80 -> valid
How can this be achieved?
So your first string can be either: one character but not W or 2+ characters. Simple pattern to achieve that is:
([^W]|\w{2,})-\w+
But this pattern is not entirely correct, because now it allows any character for first part, but originally only \w characters were expected to be allowed. So correct pattern is:
([\w&&[^W]]|\w{2,})-\w+
Pattern [\w&&[^W]] means any character from \w character class except W character.
Just restrict the last char to "any word char except 'W'".
There are a couple of ways to do this:
Negative look-behind (easy to read):
^\w+(?<!W)-\w+$
See live demo.
Negated intersection (trainwreck to read):
^\w*[\w&&[^W]]-\w+$
See live demo.
——
The question has shifted. Here’s a new take:
^.+(?<!^W)-\w+
This allows anything as the first term except just "W".

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.
Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here
In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Word that matches ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$

I am totally confused right now.
What is a word that matches: ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tried at Regex 101 this 1Test#!. However that does not work.
I really appreciate your input!
What happens is that your regex seems to be in Java-flavor (Note the \\d)
that is why you have to convert it to work with regex101 which does not work with jave (only works with php, phyton, javascript)
see converted regex:
^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
which will match your string 1Test#!. Demo here: http://regex101.com/r/gE3iQ9
You just want something that matches that regex?
Here:
a1a!
This pattern matches
\dTest#!
if u want a pattern which matches 1Test#! try this pattern
^.(?=.\d)(?=.[a-zA-Z])(?=.[!##$%^&]).*$
Your java string ^.*(?=.*\\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$ encodes the regexp expression ^.*(?=.*\d)(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$.
This is because the \ is an escape sequence.
The latter matches the string you specified.
If your original string was a regexp, rather than a java string, it would match strings such as \dTest#!
Also you should consider removing the first .*, doing so would make the regexp more efficient. The reason is that regexp's by default are greedy. So it will start by matching the whole string to the initial .*, the lookahead will then fail. The regexp will backtrack, matchine the first .* to all but the last character, and will fail all but one of the loohaheads. This will proceed until it hits a point where the different lookaheads succeed. Dropping the first .*, putting the lookahead immidiately after the start of string anchor, will avoid this problem, and in this case the set of strings matched will be the same.

Why does my regex containing \d{1,} together with a negative lookahead still match, where it shouldn't?

I'm trying to match a coordinate pair in a String using a Regex in Java. I explicitly want to exclude strings using negative lookahead.
to be matched:
558,228
558,228,
558,228,589
558,228,A,B,C
NOT to be matched:
558,228,<Text>
The Regex ^558,228(?!,<).* does the job, while ^\d{1,},\d{1,}(?!,<).* doesn't. It's the same regex with the metacharacter \d instead of values. Any ideas why?
The reason is the .* part at the end. It matches everything that wasn't matched earlier.
In combination with \d{1,}, which allows to match less than 3 digits, it will go like this:
^\d{1,},\d{1,}(?!,<) will match 558,22 and .* will match the remaining part 8,<Text>.
The problem is the \d{1,} part in combination with the .* at the end.
In your case
558,228,<Text>
The ^\d{1,},\d{1,}(?!,<) matches ">558,22" and the .* matches the rest "8,<Text>"
You can solve this using the possessive quanitifier ++
^\d+,\d++(?!,<)(.*)
See it here online on Regexr
\d++ is a seldom used possessive quantifier, which is here useful. ++ means match at least once as many as you can and do not backtrack. That means it will not give back the digits once it has found them.
Java Quantifier tutorial

Categories