Regex format for a particular Match - java

I am trying to write a regex for the following format
PA-123456-067_TY
It's always PA, followed by a dash, 6 digits, another dash, then 3 digits, and ends with _TY
Apparently, when I write this regex to match the above format it shows the output correctly
^[^[PA]-]+-(([^-]+)-([^_]+))_([^.]+)
with all the Negation symbols ^
This does not work if I write the regex in the below format without negation symbols
[[PA]-]+-(([-]+)-([_]+))_([.]+)
Can someone explain to me why is this so?

The negation symbol means that the character cannot be anything within the specified class. Your regex is much more complicated than it needs to be and is therefore obfuscating what you really want.
You probably want something like this:
^PA-(\d+)-(\d+)_TY$
... which matches anything that starts with PA-, then includes two groups of numbers separated by a dash, then an underscore and the letters TY. If you want everything after the PA to be what you capture, but separated into the three groups, then it's a little more abstract:
^PA-(.+)-(.+)_(.+)$
This matches:
PA-
a capture group of any characters
a dash
another capture group of any characters
an underscore
all the remaining characters until end-of-line
Character classes [...] are saying match any single character in the list, so your first capture group (([^-]+)-([^_]+)) is looking for anything that isn't a dash any number of times followed by a dash (which is fine) followed by anything that isn't an underscore (again fine). Having the extra set of parentheses around that creates another capture group (probably group 1 as it's the first parentheses reached by the regex engine)... that part is OK but probably makes interpreting the answer less intuitive in this case.
In the re-write however, your first capture group (([-]+)-([_]+)) matches [-]+, which means "one or more dashes" followed by a dash, followed by any number of underscores followed by an underscore. Since your input does not have a dash immediately following PA-, the entire regex fails to find anything.
Putting the PA inside embedded character classes is also making things complicated. The first part of your first one is looking for, well, I'm not actually sure how [^[PA]-]+ is interpreted in practice but I suspect it's something like "not either a P or an A or a dash any number of times". The second one is looking for the opposite, I think. But you don't want any of that, you just want to start without anything other than the actual sequence of characters you care about, which is just PA-.
Update: As per the clarifications in the comments on the original question, knowing you want fixed-size groups of digits, it would look like this:
^PA-(\d{6})-(\d{3})_TY$
That captures PA-, then a 6-digit number, then a dash, then a 3-digit number, then _TY. The six digit number and 3 digit numbers will be in capture groups 1 and 2, respectively.
If the sizes of those numbers could ever change, then replace {x} with + to just capture numbers regardless of max length.

according to your comment this would be appropriate PA-\d{6}-\d{3}_TY
EDIT: if you want to match a line use it with anchors: ^PA-\d{6}-\d{3}_TY$

Related

How to combine these regex for javascript

Hi I am trying to use regEx in JS for identifying 3 identical consecutive characters (could be alphabets,numbers and also all non alpha numeric characters)
This identifies 3 identical consecutive alphabets and numbers : '(([0-9a-zA-Z])\1\1)'
This identifies 3 identical consecutive non alphanumerics : '(([^0-9a-zA-Z])\1\1)'
I am trying to combine both, like this : '(([0-9a-zA-Z])\1\1)|(([^0-9a-zA-Z])\1\1)'
But I am doing something wrong and its not working..(returns true for '88aa3BBdd99##')
Edit : And to find NO 3 identical characters, this seems to be wrong /(^([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1)/ --> RegEx in JS to find No 3 Identical consecutive characters
thanks
Nohsib
The problem is that backreferences are counted from left to right throughout the whole regex. So if you combine them your numbers change:
(([0-9a-zA-Z])\2\2)|(([^0-9a-zA-Z])\4\4)
You could also remove the outer parens:
([0-9a-zA-Z])\1\1|([^0-9a-zA-Z])\2\2
Or you could just capture the alternatives in one set of parens together and append one back-reference to the end:
([0-9a-zA-Z]|[^0-9a-zA-Z])\1\1
But since your character classes match all characters anyway you can have that like this as well:
([\s\S])\1\1
And if you activate the DOTALL or SINGLELINE option, you can use a . instead:
(.)\1\1
It's actually much simpler:
(.)\1\1
The (.) matches any character, and each \1 is a back reference that matches the exact string that was matched by the first capturing group. You should be aware of what the . actually matches and then modify the group (in the parentheses) to fit your exact needs.

Regular Expression of a Specific Word

I want to create a regular expression in java using standard libraries that will accommodate the following sentence:
12 of 128
Obviously the numbers can be anything though... From 1 digit to many
Also, I'm not sure how to accommodate the word "of" but I thought maybe something along the lines of:
[\d\sof\s\d]
This should work for you:
(\d+\s+of\s+\d+)
This will assume that you want to capture the full block of text as "one group", and there can be one-or-more whitespace characters in between each (if only one space, you can change \s+ to just \s).
If you want to capture the numbers separately, you can try:
(\d+)\s+of\s+(\d+)
You want this:
\d+\sof\s\d+
The relevant change from what you already had is the addition of the two plus signs. That means, that it should match multiple digits but at least one.
Sample: http://regexr.com?32cao
This regexp
"\\d+ of \\d+"
will match at least one to any number of digits, followed by string " of " followed by one to any number of digits.

In Java regex - how to retain numbers ONLY when attached to string

I'm trying to tokenize text files that contain useful text but also many numbers that I don't want. However, using something like [^a-zA-Z0-9], I retain all digits (0-9).
I would like to retain digits ONLY if attached to characters OR hypnenated like "24hr" or "7-days".
So, input: "There are 3, 24hr positions available 7-days a week. Call 555-1212"
Returns a list of the following tokens: There are 24hr positions available 7-days a week Call
Thanks for any help!
\d+-?[A-Za-z]+|[A-Za-z]+-?\d+|[A-Za-z]+
See it here in action: http://regexr.com?318em
The square brackets [, ] represent something called a character class, which basically means match anything in this class. [A-Za-z0-9] will match any combination of letters and digits.
If you want to specify order you need to remove the digits from the character class and add another character class after it.
ex:
[0-9]+-?[a-zA-Z]+|[a-zA-Z]+-?[0-9]+|[a-zA-Z]+
[a-zA-Z]+ - matches 1 or more letters
-? - optionally matches a dash
[0-9]+ - matches 1 or more digits
After lots of trial and error, this did it (note leading space):
\d[^-a-z]+ | -\d+|[^a-zA-Z0-9-]|[0-9]+-[0-9]+|\W-+|[0-9]+-\W
http://regexr.com?318hp
I hope this helps anyone else who needs it.
I'm using it in RapidMiner to remove unwanted tokens in text processing.

Regular Expression to match one or more digits 1-9, one '|', one or more '*" and zero or more ','

I'm new to regular expressions and I need to find a regular expression that matches one or more digits [1-9] only ONE '|' sign, one or more '*' sign and zero or more ',' sign.
The string should not contain any other characters.
This is what I have:
if(this.ruleString.matches("^[1-9|*,]*$"))
{
return true;
}
Is it correct?
Thanks,
Vinay
I think you should test separately for every type of symbols rather than write complex expression.
First, test that you don't have invalid symbols - "^[0-9|*,]$"
Then test for digits "[1-9]", it should match at least one.
Then test for "\\|", "\\*" and "\\," and check the number of matches.
If all test are passed then your string is valid.
Nope, try this:
"^[1-9]+\\|\\*+,*$"
Please give us at least 10 possible matching strings of what you are looking to accept, and 10 of what you want to reject, and tell us if either this have to keep some sequence or its order doesn't matter. So we can make a reliable regex.
By now, all I can offer is:
^[1-9]+\|{1}\*+,*$
This RegEx was tested against these sample strings, accepting them:
56421|*****,,,
2|*********,,,
1|*
7|*,
18|****
123456789|*
12|********,,
1516332|**,,,
111111|*
6|*****,,,,
And it was tested against these sample strings, rejecting them:
10|*,
2***525*|*****,,,
123456,15,22*66*****4|,,,*167
1|2*3,4,5,6*
,*|173,
|*,
||12211
12
1|,*
1233|54|***,,,,
I assume your given order is strict and all conditions apply at the same time.
It looks like the pattern you need is
n-n, one or more times seperated by commas
then a bar (|)
then n*n, one or more times seperated by commas.
Here is a regular expression for that.
([1-9]{1}[0-9]*\-[0-9]+){1}
(,[1-9]{1}[0-9]*\-[0-9]+)*
\|
([1-9]{1}[0-9]*\*[0-9]+){1}
(,[1-9]{1}[0-9]*\*[0-9]+)*
But it is so complex, and does not take into account the details, such as
for the case of n-m, you want
n less than m
(I guess).
And you likely want the same number of n-m before the bar, and x*y after the bar.
Depends whether you want to check the syntax completely or not.
(I hope you do want to.)
Since this is so complex, it should be done with a set of code instead of a single regular expression.
this regex should work
"^[1-9\\|\\*,-]*$"
Assert position at the beginning of the string «^»
Match a single character present in the list below «[1-9\|*,-]»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «»
A character in the range between “1” and “9” «1-9»
A | character «\|»
A * character «*»
The character “,” «,»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

regex for specific digit prefix

I am trying to have the following regx rule, but couldn't find solution.
I am sorry if I didn't make it clear. I want for each rule different regx. I am using Java.
rule should fail for all digit inputs start with prefix '1900' or '1901'.
(190011 - fail, 190111 - fail, 41900 - success...)
rule should success for all digit inputs with the prefix '*'
different regex for each rule (I am not looking for the combination of both of them together)
Is this RE fitting the purpose ? :
'\A(\*|(?!190[01])).*'
\A means 'the beginning of string' . I think it's the same in Java's regexes
.
EDIT
\A : "from the very beginning of the string ....". In Python (which is what I know, in fact) this can be omitted if we use the function match() that always analyzes from the very beginning, instead of search() that search everywhere in a string. If you want the regex able to analyze lines from the very beginning of each line, this must be replaced by ^
(...|...) : ".... there must be one of the two following options : ....."
\* : "...the first option is one character only, a star; ..." . As a star is special character meaning 'zero, one or more times what is before' in regex's strings, it must be escaped to strictly mean 'a star' only.
(?!190[01]) : "... the second option isn't a pattern that must be found and possibly catched but a pattern that must be absent (still after the very beginning). ...". The two characters ?! are what says 'there must not be the following characters'. The pattern not to be found is 4 integer characters long, '1900' or '1901' .
(?!.......) is a negative lookahead assertion. All kinds of assertion begins with (? : the parenthese invalidates the habitual meaning of ? , that's why all assertions are always written with parentheses.
If \* have matched, one character have been consumed. On the contrary, if the assertion is verified, the corresponding 4 first characters of the string haven't been consumed: the regex motor has gone through the analysed string until the 4th character to verify them, and then it has come back to its initial position, that is to say, presently, at the very beginning of the string.
If you want the bi-optional part (...|...) not to be a capturing group, you will write ?: just after the first paren, then '\A(?:\*|(?!190[01])).*'
.* : After the beginning pattern (one star catched/matched, or an assertion verified) the regex motor goes and catch all the characters until the end of the line. If the string has newlines and you want the regex to catch all the characters until the end of the string, and not only of a line, you will specify that . must match the newlines too (in Python it is with re.MULTILINE), or you will replace .* with (.|\r|\n)*
I finally understand that you apparently want to catch strings composed of digits characters. If so the RE must be changed to '\A(?:\*|(?!190[01]))\d*' . This RE matches with empty strings. If you want no-match with empty strings, put \d+ in place of \d* . If you want that only strings with at least one digit, even after the star when it begins with a star, match, then do '\A(?:\*|(?!190[01]))(?=\d)\d*'
For the first rule, you should use a combo regex with two captures, one to capture the 1900/1901-prefixed case, and one the capture the rest. Then you can decide whether the string should succeed or fail by examining the two captures:
(190[01]\d+)|(\d+)
Or just a simple 190[01]\d+ and negate your logic.
Regex's are not really very good at excluding something.
You may exclude a prefix using negative look-behind, but it won't work in this case because the prefix is itself a stream of digits.
You seem to be trying to exclude 1-900/901 phone numbers in the US. If the number of digits is definite, you can use a negative look-behind to exclude this prefix while matching the remaining exact number digits.
For the second rule, simply:
\*\d+

Categories