Regex to match first/last names with optional titles - java

I created the following regex (Java):
(Lord |Lady |Ser )?(Agatha|John)?([ ]??Cain)?
It's working fine except in one situation (and maybe others I didn't take into account during my tests):
As you can see, when you only have the family name, the regex is also taking the whitespace behind the word. I totally understand why, but I don't know how to fix it.
This regex is used to find persons into a big text file which represents the content of a book. And, of course, it must be compatible with my current working environment (Java).

You can use regex lookback to accomplish your goal.
\b(?<!\S)(?:(Lord|Lady|Ser)\s+)?(Agatha|John)?(?:\s*(?<=\b)(Cain))?(?<=\S)\b # regex101
It has these qualities which seem to match (possibly exceed) your criteria:
The regex match is forced to start with a non-whitespace character.
The first capture will be the title (or empty).
The second capture will be the first name (or empty).
The third capture will be the last name (or empty).
All matches have no leading or trailing whitespace.
Additionally, it will even match through line wraps (shown in additional text in the linked regex test sample).
Title, first, and last names are in singleton groups making additions to the match sets as simple as adding an additional alternation to their respective groups.
A trailing lookbehind insisting on the match ending with a non-whitespace was also added to avoid matching just "Lord " of an otherwise non-matching "Lord X".
A regex101 fiddle with example data is linked to the regex.

Related

Regex limit repeated class sub character

I have a email address filtering regex used in Java. It works for the most part except when trying to limit repeated dot's in the username section of the email address.
The regex I'm using (with escaping removed) is [a-zA-Z0-9\.\_\-]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,5}(\.[a-zA-Z]{2,5}){0,1}
This doesn't catch a bad email address like test..test#test.com. I've tried applying limiters to the class [a-zA-Z0-9\.\_\-] but that causes it to fail on valid email addresses.
Any thoughts would be greatly appreciated.
Add a negative lookahead for two dots anchored to start:
^(?!.*\.\.)[a-zA-Z0-9._-]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,5}(\.[a-zA-Z]{2,5}){0,1}
This expression (?!.*\.\.) means the following text does not contain 2 consecutive dots.
By the way, you don't need to escape most characters when they are within a character class, including the characters ._-, ie [a-zA-Z0-9\.\_\-] is the same as [a-zA-Z0-9._-] (with the caveat that a dash is a literal dash when it appears first or last).
Using lookaheads makes adding overall constraints easy and you can easily add more, for example, to require that the overall length is at least 10 chars add (?=.{10}) to the front:
^(?=.{10})(?!.*\.\.)[a-zA-Z0-9\.\_\-]+#[a-zA-Z0-9]+\.[a-zA-Z]{2,5}(\.[a-zA-Z]{2,5}){0,1}

Regular expression to return results that do not match selection

I work on a product that provides a Java API to extend it.
The API provides a function which
takes a Perl regular expression and
returns a list of matching files.
I want to filter the list to remove all files that end in .xml, .xsl and .cfg; basically the opposite of .*(\.xml|\.xsl|\.cfg).
I have been searching but I haven't been able to get anything to work yet.
I tried .*(?!\.cfg) and ^((?!cfg).)*$ and \.(?!cfg$|?!xml$|?!xsl$).
I don't know if I am on the right track or not.
Note
I know the regex systems are similar, but I can't get a Java regex working either.
You may use
^(?!.*\.(x[ms]l|cfg)$).+
See the regex demo
Details:
^ - start of a string
(?!.*\.(x[ms]l|cfg)$) - a negative lookahead that fails the match if any 0+ chars other than line break chars (.*) are followed with xml, xsl or cfg ((x[ms]l|cfg)) at the end of the string ($)
.+ - any 1 or more chars other than linebreak chars. Might be omitted if the entire string match is not required (in some tools it is required though).
You need something like this, which matches only if the end of the string isn't preceded by a dot and one of the three unwanted types
/(?<!\.(?:xml|xsl|cfg))\z/

Matching the last group of something in Java

I have the problem to define the regexpression (for a Java program), that gives me the last matching group of something. The reason for that is the conversion of some text files (here: the export of some wiki) to the new format of the new wiki.
For example, when I have the following text:
Here another include: [[Include(a/a-1)]]
The hierarchy of the pages is:
/a
/a-1
The old wiki referenced the hierarchy name, the new wiki will only have the title of the page. The new format should look like:
{include:a-1}
Currently I have the following regular expression:
/\[\[Include\(([^\)]+)\)\]\]/
which matches from the example above a/a-1, but I need a regular expression that matches only a-1.
Is it possible to construct a regular expression for java that matches the last group only?
So for the following original lines:
[[Include(a)]]
[[Include(a/b)]]
[[Include(a/a-1)]]
[[Include(a/a-1/a-2)]]
I would like to match only
a
b
a-1
a-2
This is the regex you're looking for. Group 1 has the text you want, see the captures pane at the bottom right of the demo, as well as the Substitutions pane at the bottom.
EDIT: per your request, replaced the [a-z0-9-] with [^/] (Did not update the regex101 demo as this regex, which I confirmed to work, breaks in regex101, which uses / as a delimiter, even when escaping the /. However here is another demo on regexplanet)
Search:
\[\[Include\((?:[^/]+\/)*([^/]+)\)\]\]
Replace:
{include:$1}
How does it work?
After the opening bracket of the Include, we match a combination of characters such as a-1 (made of letters, dash and digits) followed by a forward slash, zero or more times, then we capture the last such combination of characters.
In the few languages that support infinite-width lookbehinds, we could match what you want without relying on Group 1 captures.

Regular Expression - Return all matches as a single match

I'm working with a piece of code that applies a regex to a string and returns the first match. I don't have access to modify the code to return all matches, nor do I have the ability to implement alternative code.
I have the following example target string:
usera,userb,,userc,,userd,usere,userf,
This is a list of comma delimited usernames joined from multiple sources, some of which were blank resulting in two commas in some places. I'm trying to write a regex that will return all of the comma delimited usernames except for specific values.
For example, consider the following expression:
[^,]\w{1,},(?<!(userb|userc|userd),)
This results in three matches:
usera,
usere,
userf,
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,' ?
If I could write code in any language this would be trivial, but I'm limited to input of only the target string and the pattern, and I need a single match that has all items except for the ones I'm omitting. I'm not sure if this is even possible, everything I've ever done with regex's involves processing multiple items in a match collection.
Here is an example in Regex Coach. This image shows that there are the three matches I want, but my requirement is to have the text in a single match, not three separate matches.
EDIT1:
To clarify this ticket is specifically intended to solve the use case using only regular expression syntax. Solving this problem in code is trivial but solving it using only a regex was the requirement given the fact that the executing code is part of a 3rd party product that I didn't want to reverse engineer, wrap, or replace.
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,'?
No. Regex matches are consecutive.
A regular expression matches a (sub)string from start to finish. You cannot drop the middle part, this is not how regex engines work. But you can apply the expression again to find another matching substring (incremental search - that's what Regex Coach does). This would result in a match collection.
That being said, you could also just match everything you don't want to keep and remove it, e.g.
,(?=[\s,]+)|(userb|userc|userd)[\s,]*
http://rubular.com/r/LOKOg6IeBa

Optional substring in capturing group

I have a regex that correctly captures a slash followed by a number in a string. The capturing group portion of the regex looks like this:
\(\d)+\\??
(some digits after a slash up to, but not including, a question mark) and there is more to the regex before and after this capturing group. Now I want to also include in my capturing group a optional specific prefix (call it "abc_"):
The entire prefix (all four characters) must be there to be included in the captured group
If no prefix is present then the digit portion of the capturing group is still captured
if the prefix is partially there or some other prefix is there then the string does not match the regex.
Some examples:
abc_12345 is captured
12345 is captured
ab_12345 fails to match the regex
abc_ fails to match the regex
abcd_ fails to match the regex
How do I construct this?
If I understand you correctly, you want this:
((?:abc_)?\\d+)[?]?
The ?: operator transforms the group into a non-catching group. I don't understand the part with the partial prefix. If you allow any content in front of the regex, you cannot deny a certain optional prefix. You need to have a clear separator in front of the pattern, like a white space in order to deny a prefix.
Your regex does not even seem to work for the case you already described. It captures only one digit and not the full number. Also your escaping is inconsistent.
However, this should do what you intend to do:
((?:abc_)?\\d+)\\??
Your last requirement, that a different prefix should not be matched, can only be answered if you give use the preceding part of the regex. (If this capturing group is preceded by \w+ for instance, any prefix would match, but only the full and correct prefix would be captured)

Categories