I am trying to create a regex pattern to match abcdef, abc def, etc.
The patterns I tried are:
[a-z]{3}[\s?][a-z]{3}
[a-z]{3}[\s*][a-z]{3}
[\w]{3}[\s?][\w]{3}
\w{3}\s?\w{3}
All of these seem to work for abc def, but not for abcdef.
{EDIT}
AB CD 12 ABC/15 DEF
.*\bAB CD\b\s?(\d+)\s?\bABC\b[/](\d+)\s?\bDEF\b
i am trying to extract 12 and 15.
How about:
[\w]{3}\s?[\w]{3}
or any of the other combinations. Just remove the \s from the brackets or put the quantifier (i.e. * and ?) outside of the bracket for the space selector:
[\w]{3}[\s]?[\w]{3}
Your bottom one should work too.
First one is close
[a-z]{3}\s?[a-z]{3}
Take the ? Out of the square brackets
java will not be reading it correct. you will need to.escape the \s to \s
Related
Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.
I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.
Is this possible with regular expressions?
When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)
a[^abc]*c
This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):
a[^abc]*b[^ac]*c
When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:
abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz
See the regex demo
To make sure it matches across lines, use re.DOTALL flag when compiling the regex.
Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.
Pattern details:
abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz
See the diagram below (if re.S is used, . will mean AnyChar):
See the Python demo:
import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']
Using PCRE a solution would be:
This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively
abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz
Debuggex Demo
The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:
where val like 'abc%123%xyz' and
val not like 'abc%abc%' and
val not like '%xyz%xyz'
I imagine something quite similar is simple to do in other environments.
You could use lookaround.
/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g
(I've not tested it.)
I have a following string:
'pp_3', 365]
What comes after pp_ may have different length. What comes after , and is before ] is what I'd like to capture (and only it). Its length varies but it is always a number.
I've come up with (?<=pp_).*,(.*)(?=]). It gives 3', 365 as a full match and in group 1 there is what I want '365'. How do I get only 365 as a full match?
Please let me know if I am unable to explain my doubts. Thanks
Try this:
[^_]*_(\d*)'\s*,\s*(\-?\d+)\s*].
This regex captures 2 groups, that correspond to each of the numbers, the first after pp_ and the second after ', (which may be negative). If you don't want to capture the first one as a group, instead of (\d*), just write (?:\d*).
Try this expression. The second group should be what you're after:
(?<='pp_)(\d*', )(\d*)]
To match the digits only and if you want to make use of a positive lookbehind, you could make use of a quantifier in the lookbehind (which you can specify yourself) which is supported by Java
(?<=pp_[^,]{0,1000}, )\d+(?=])
Explanation
(?<= Positive lookbehind, assert what is on the left is
pp_[^,]{0,1000} Match pp_, match any char except , 0-1000 times
, Match a comma and space
) Close lookbehind
\d+ Match 1+ digits
(?=]) Positive lookahead, assert what is on the right is ]
In Java
String regex = "(?<=pp_[^,]{0,1000}, )\\d+(?=])";
Java demo
You could also use a capturing group instead:
pp_[^,]*, (\d+)]
Regex demo
I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.
This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')
Have you tried positive lookbehind
(?<=.')
Regex101
This regex will do it:
[A-Z]+(\?\?)*'
If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.
I want to match just two strings before a matched string
e.g Rohan pillai J.
Currently i am using :
pattern= (?=\w+ J[.])\w+
Answer - pillai
desired answer - Rohan pillai
An alternative to take the first two names:
\w*\s\w*(?=\sJ\.)
Regex live here.
Explaining:
\w*\s # the first word (name) followed by space
\w* # the second word (name)
(?=\sJ\.) # must end with space and "J." - without taking it
Tip: Generally to escape regex metacharacters (like dot .) we use back-slash. Use character class like [.] if you want to put emphasis on that character (if you want to make it more visible when you will read this regex).
You need to put the look ahead in trailing :
(\w+) (\w+)(?= J\.)
See demo https://regex101.com/r/wH0oU8/1
Or more general you can use \s to match any whitespace instead of space :
(\w+)\s(\w+)(?=\sJ\.)
For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $