Wierd behaviour on regexp Matcher - java

My regexp below is supposed to filter out capital words with a length of 8-10, where 0-2 numbers may appear. It has been working for all of my tests, but for some reason it got stuck on the string below. And n.group(0) only contains an empty string instead of the matched "word".
static final Pattern PATTERN =
Pattern.compile("\\b(?=[A-Z\\d]{9,10}\\b)(?:[A-Z]*\\d){0,2}[A-Z]*\\b");
Matcher n = LONG_PASSWORD.matcher("foo ID:636152727 bar");
while (n.find()) {
String s = n.group(0);
resultArrayList.add(s);
}
Why does my pattern match ID:636152727?
Some examples that I want to filter out (which is working):
AAAAAAAAAA
1AAAAAAAAA
1AAAAAAAA1
etc...

I don't have a better solution to offer than the one in Ωmega's answer, but I think I can explain what's happening. What it boils down to is that the first \b and the last \b are matching the same spot: right after the colon.
That's the first place where the lookahead can match, since it's followed by nine digits and a word boundary. Then the next part of the regex tries to match two digits (interspersed with any number of uppercase letters) followed by a word boundary, and fails. So it tries to match just one digit (ditto), and fails again. Then it tries matching zero digits (interspersed with zero letters), and it succeeds, without advancing the match position. That position is still a word boundary, so the final \b succeeds as well.
A word boundary is just another zero-width assertion, like lookaheads and lookbehinds. There's no reason why two or more can't be applied at the same spot; you did that on purpose with the first word boundary and the lookahead. Some regex flavors treat it as an error if you apply a quantifier to an assertion (like \b+), but I don't think any of them would catch this problem. This is one of those rare instances where separate start-of-word and end-of-word assertions, like GNU's \< and \> or TCL's \y and \Y, would make a difference.

You need to use anchors ^ and $ »
Pattern.compile("^(?=[A-Z\\d]{9,10}$)(?:[A-Z]*\\d){0,2}[A-Z]*$");
Use this pattern:
"(?:^|(?<=\\s))(?=[A-Z\\d]{9,10}(?:\\s|$))(?:[A-Z]*\\d){0,2}[A-Z]*(?=\\s|$)"

Related

Regex return true if even a substring follows the pattern

I was just practicing regex and found something intriguing
for a string
"world9 a9$ b6$" my regular expression "^(?=.*[\\d])(?=\\S+\\$).{2,}$"
will return false as there is a space in between before the look ahead finds the $ sign with at least one digit and non space character.
As a whole the string doesn't matches the pattern.
What should be the regular expression if I want to return true even if a substring follows a pattern?
as in this one a9$ and b6$ both follow the regular expression.
You can use
^(?=\D*\d)(?=.*\S\$).{2,}$
See the regex demo. As The fourth bird mentions, since \S\$ matches two chars, you may simply move the pattern to the consuming part, and use ^(?=\D*\d).*\S\$.*$, see this regex demo.
Details
^ - start of string (implicit if used in .matches())
(?=\D*\d) - a positive lookahead that requires zero or more non-digit chars followed with a digit char immediately to the right of the current location
(?=.*\S\$) - a positive lookahead that requires zero or more chars other than line break chars, as many as possible, followed with a non-whitespace char and a $ char immediately to the right of the current location
.{2,} - any two or more chars other than line break chars, as many as possible
$ - end of string (implicit if used in .matches())
Mostly, knock out the ^ and $ bits, as those force this into a full string match, and you want substring matches. In general, look-ahead seems like a mistake here, what are you trying to accomplish by using that? (Look-ahead/look-behind is rarely needed in general). All you need is:
Pattern.compile("\\S+\\$");
possibly, if you want an element (such as a9$) to stand on its own, use \b which is regexpese for word break: Basically, whitespace (and a few other characters, such as underscores. Most non-letter, non-digits characters are considered a break. Think [^a-zA-Z0-9]) - but \b also matches start/end of input. Thus:
Pattern.compile("\\b\\S+\\$\\b")
still matches foo a9$ bar, or a9$ just fine.
If you MUST put this in terms of a full match, e.g. because matches() (which always does a full string match) is run and you can't change that, well, put ^.* in front and .*$ at the back of it, simple as that.
Absolutely nothing about this says "This can only be needed with lookahead".

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.
Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here
In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Java/Regex - finding a characters anywhere in a String

I have a series of strings that I am searching for a particular combination of characters in. I am looking for a digit, following by the letter m or M, followed by a digit, then followed by the letter f or F.
An example string is - "Class (4) 1m5f Good" - The text in bold is what I want to extract from the string.
Here is the code I have, that doesn't work.
Pattern distancePattern = Pattern.compile("\\^[0-9]{1}[m|M]{1}[0-9]{1}[f|F]{1}$\\");
Matcher distanceMatcher = distancePattern.matcher(raceDetails.toString());
while (distanceMatcher.find()) {
String word= distanceMatcher.group(0);
System.out.println(word);
}
Can anyone suggest what I am doing wrong?
The ^ and $ characters at the start and end of your regex are anchors - they're limiting you to strings that only consist of the pattern you're looking for. The first step is to remove those.
You can then either use word boundaries (\b) to limit the pattern you're looking for to be an entire word, like this:
Pattern distancePattern = Pattern.compile("\\b\\d[mM]\\d[fF]\\b");
...or, if you don't mind your pattern appearing in the middle of a word, e.g., "Class (4) a1m5f Good", you can drop the word boundaries:
Pattern distancePattern = Pattern.compile("\\d[mM]\\d[fF]");
Quick notes:
You don't really need the {1}s everywhere - the default assumption
is that a character or character class is happening once.
You can
replace the [0-9] character class with \d (it means the same
thing).
Both links are to regular-expressions.info, a great resource for learning about regexes that I highly recommend you check out :)
I'd use word boundaries \b:
\b\d[mM]\d[fF]\b
for java, backslashes are to be escaped:
\\b\\d[mM]\\d[fF]\\b
{1} is superfluous
[m|M] means mor | or M
For the requirement of a digit, following by the letter m or M, followed by a digit, then followed by the letter f or F regex can be simplified to:
Pattern distancePattern = Pattern.compile("(?i)\\dm\\df");
Where:
(?i) - For ignore case
\\d - For digits [0-9]

regular expression to allow only 1 dash

I have a textbox where I get the last name of a user. How do I allow only one dash (-) in a regular expression? And it's not supposed to be in the beginning or at the end of the string.
I have this code:
Pattern p = Pattern.compile("[^a-z-']", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(name);
Try to rephrase the question in more regexy terms. Rather than "allow only one dash, and it can't be at the beginning" you could say, "the string's beginning, followed by at least one non-dash, followed by one dash, followed by at least one non-dash, followed by the string's end."
the string's beginning: `^
at least one non-dash: [^-]+
followed by one dash: -
followed by at least one non-dash: [^-]+
followed by the string's end: $
Put those all together, and there you go. If you're using this in a context that matches against the complete string (not just any substring within it), you don't need the anchors -- though it may be good to put them in anyway, in case you later use that regex in a substring-matching context and forget to add them back in.
Why not just use indexOf() in String?
String s = "last-name";
int first = s.indexOf('-');
int last = s.lastIndexOf('-');
if(first == 0 || last == s.length()-1) // Checks if a dash is at the beginning or end
System.out.println("BAD");
if(first != last) // Checks if there is more than one dash
System.out.println("BAD");
It is slower than using regex but with usually small size of last names it should not be noticeable in the least bit. Also, it will make debugging and future maintenance MUCH easier.
It looks like your regex represents a fragment of an invalid value, and you're presumably using Matcher.find() to find if any part of your value matches that regex. Is that correct? If so, you can change your pattern to:
Pattern p = Pattern.compile("[^a-zA-Z'-]|-.*-|^-|-$");
which will match a non-letter-non-hyphen-non-apostrophe character, or a sequence of characters that both starts and ends with hyphens (thereby detecting a value that contains two hyphens), or a leading hyphen, or a trailing hyphen.
This regex represents one or more non-hyphens, followed by a single hyphen, followed by one or more non-hyphens.
^[^\-]+\-[^\-]+$
I'm not sure if the hyphen in the middle needs to be escaped with a backslash... That probably depends on what platform you're using for regex.
Try pattern something like [a-z]-[a-z].
Pattern p = Pattern.compile("[a-z]-[a-z]");

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?
Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Categories