Tersest regex for a letter - java

Just wondering what the briefest way of specifying a letter in regex (in java) is, as measured by the number of characters in the regex. For example, \w doesn't work, because it includes numbers and the underscore.
Here are a couple of options:
\p{Alpha}
[a-zA-Z]
Is there anything shorter? It comes up so often that it would be good to know.

How about this:
\pL
It matches a Unicode letter.
The following expression matches only [a-zA-Z] and it has only eight characters, but the backslashes need to be escaped inside a Java string literal, which will increase the character count to 10.
[^\PL\W]

This is the shortest I can think of:
(?i)[a-z]
However, this turns the case-insensitive switch on for the remainder of the regex, so if the rest of the regex needs to be case sensitive, you would need to turn the switch off again:
(?i)[a-z](?-i)
Which then makes it longer than [a-zA-Z].
So this answer is possibly the shortest, if it suits the situation.

Related

Pattern matching with special character "-" hyphen isn't working as expected .? [duplicate]

How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");

Constructing a specific regex

I want to make a regular expression in Java, with the next criteria:
Length: 10 characters exactly. Not more, not less.
Can accept any character between A-Z (only uppercase letters) and between digits 0-9.
Can accept only one dash character '-' in any position. It cannot accept any other characters, strictly only one dash.
EXAMPLES:
ABCD-12345
F-01234GHK
09-PL89GG5
LJ8U9N3-Y2
PLN86D4V-1
I have been making tries with regex of my own invention, some regular expressions that are close to the result I want, but with no success.
Do I have to combine two regular expressions?
Please, help me to get rid of this issue.... and thanks in advance!!!
I think you need lookahead (which is a way of combining two regular expressions, sort of).
^(?!.*-.*-)[A-Z0-9-]{10}$
The second part will match 10 characters that are A-Z, 0-9, or dash; the first part is negative lookahead that will reject a pattern that has two dashes in it.
You can use this:
^(?![^-]*+-[^-]*+-)[A-Z0-9-]{10}$
Note: If you use the matches method you can remove anchors.

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?
Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Regex (Java) to remove all characters up to but not including (a number or a letter a-f followed by a number)

I need help constructing the regular expression to remove all characters up to but not including (a number or a letter a-f followed by a number) in Java:
Here's what I came up with (doesn't work):
string.replaceFirst(".+?(\\d|[a-f]\\d)","");
That line of code replaces the entire string with an empty string.
.+? is every character up to \\d a digit OR [a-f]\\d any of the letters a-f followed by a digit.
This doesn't work, however, can I have some help?
Thanks
EDIT: changed replace with replaceFirst
First off, replace() acts on literals, not regexes. You should use replaceFirst or replaceAll depending on what you want. Your regex problem is that you're including the suffix as part of the string to replace. You can give this a try:
input.replaceFirst(".+?(\\d|[a-f]\\d)","$1")
Here I just include the suffix in the replacement string as well. The more correct approach is to make that a zero-width assertion so that it doesn't get included in the region to replace. You can use a positive lookahead:
input.replaceFirst(".+?(?=(\\d|[a-f]\\d))", "")
The other answers given here have the problem that if the string starts with a-f followed by a number, or just a number, they will actually match and replace the first character. Not sure if that's a relevant scenario. This more convoluted pattern should work though:
"([^a-f\\d]|([a-f](?!\\d)))+"
(that is, everything that's not a digit or a-f, or a-f not followed by a digit).
I'd suggest something along the lines of
string.replaceFirst(".*?(?=(\\d|[a-f]\\d))", "");
s = s.replaceFirst(".*?(?=[a-f]?\\d)", "");
Using .*? instead of .+? insures that the first character gets checked by the lookahead, solving the problem #johusman mentioned. And while your (\\d|[a-f]\\d) isn't causing a problem, [a-f]?\\d is both more efficient and more readable.

Vowel ending detection via regex

With regular expressions, how can I test just the last characters of the given string for a match?
I want to check if something ends in any of the following:
vowel+consonant ("like 'ur' in devour")
vowel+'nt' ("paint")
vowel+'y' ("play")
and some others that are similar.
How can I do this with regular expressions (in Java)?
edit:
How would I use regular expressions to find out if a verb ends in the pattern
consonant-'e'
or various other combinations like
'ss' 'x' 'sh' 'ch' (sibilants)
in order to properly conjugate them in English as verbs.
I think this is the expression that you want. The first bit checks for the vowel and the second looks for any consonant or 'nt'
[aeiou]([^aeiou\W\d]|nt)$
I've checked it on http://regexpal.com/ which is my usual tester. The [^aeiou\W\d] means 'any that isn't a vowel, is alpha-numeric but isn't a number'. It could just be replace by all the consonants, I suppose:
[aeiou]([bcdfghjklmnpqrstvwxyz]|nt)$
Note that this ignores any possibility of any characters other than those listed. It also tests lower case but I'm unsure how to do case insensitive regex in Java.
you need a regular expression like this
^.*[aeiou]([^aeiou]|nt)$
which zero or more chars, followed by one of a,e,i,o and u, followed by one char that isn't a vowel or by exactly nt
As was pointed out in the comments, the [^aeiou] doesn't perform as intended unless you assume only alpha chars are used

Categories