how to build validation regex with starting and ending character validations? - java

how to build explicit Regex for string with alphabet at start and underscore or digit in the middle and alphabet or digit at end?
the pattern tried so far can be seen here with test cases.
https://regex101.com/r/JedpJu/3
I want to filter out strings like following.
_ (only underscore)
9a_d (string starting with numbers)
ad_ (ending with underscores)
EDIT
ad*d_rr (any special character apart from underscore also should not be allowed.)

You may use
^[A-Za-z](?:[A-Za-z0-9_]*[A-Za-z0-9])?$
which is the same as
^[A-Za-z](?:\w*[A-Za-z0-9])?$
See the regex demo
In Java, you may use it with .matches() and omit the anchors:
s.matches("[A-Za-z](?:[A-Za-z0-9_]*[A-Za-z0-9])?")
s.matches("[A-Za-z](?:\\w*[A-Za-z0-9])?")
If the string may include line breaks use
s.matches("(?s)[A-Za-z](?:[A-Za-z0-9_]*[A-Za-z0-9])?")
s.matches("(?s)[A-Za-z](?:\\w*[A-Za-z0-9])?")
where (?s) enables . to match line break chars.

Related

RegEx for combining multiple sequences

As many people ,i am struggling with what it seems a "trivial" regex issue.
in a given text, whenever I encounter a word within {} brackets i need to extract it.At first i used
"\\{-?(\\w{3,})\\}"
and it worked ok:
as long as the word didnt have any white space or special character like ' .
For example {Project} returns Project.But {Project Test} or {Project D'arce} don't return anything.
i know that for white characters i need to use \s.But it is absolutely not clear for me how to add to the above , i tried :
"%\\{-?(\\w(\\s{3,})\\)\\}"))
but not working.Also what if i want to add words containing a special characters like ' ??? Its really frustrating
How about matching any character inside {..} which is not }?
To do so you can use negated character class [^..] like [^}]. So your regex can look like
"\\{[^}]{3,}\\}"
But if you want to limit your regex only to some specific alphabet you can also use character class to combine many characters and even predefined shorthand character classes like \w \s \d and so on.
So if you want to accept any word character \w or whitespace \s or ' your regex can look like
"\\{[\\w\\s']{3,}\\}"
You could use a character class [\w\s']and add to it what you could allow to match:
\{-?([\w\s']{3,})}
In Java
String regex = "\\{-?([\\w\\s']{3,})}";
Regex demo
If you want to prevent matching only 3 whitespace chars, you could use a repeating group:
\{-?\h*([\w']{3,}(?:\h+[\w']+)*)\h*}
About the pattern
\{ Match { char
-? Optional hyphen
\h* Match 0+ times a horizontal whitespace char
([\w\s']{3,}) Capture in a group matching 3 or more times either a word char, whitespace char or '
(?:\h[\w']+)* Repeat 0+ times matching 1+ horizontal whitespace chars followed by what is listed in the character class
\h* Match 0+ times a horizontal whitespace char
} Match }
In Java
String regex = "\\{-?\\h*([\\w']{3,}(?:\\h+[\\w']+)*)\\h*}";
Regex demo

Regex to match lines not containing multiple whole words in any order

i need to exclude two strings on same line using AND. but i am not able to apply proper Regex. please help.
here below example, I want to exclude 'badword' and 'test' both.
Test String:
ghjghj badword test ghjghj
one two
abadwords
three
and the regex used is ^((?!badword).*.((?!test).))*$
To match lines that do not contain several whole words you may use either an alternation inside one negative lookahead anchored at the start or use several negative lookaheads each containing a single whole word matching pattern:
^(?!.*\b(?:badword|test)\b).*$
Or
^(?!.*\bbadword\b)(?!.*\btest\b).*$
See the regex demo. You might need to use a multiline modifier to make ^ and $ match start/end of lines.
Details
^ - start of string/line
(?!.*\b(?:badword|test)\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are 0+ chars other than line break chars followed with a word boundary, badword or test substrings and then a word boundary
.* - then having any 0+ chars other than line break chars
$ - end of string/line.
This seems to do the trick:
^(?!.*\bbadword\b)(?!.*\btest\b).*$

Regex Match word that include a Dot

I have a Question I have this Sentence for Example:
"HalloAnna daveca.nn dave anna ca. anna"
And I only wanna match the single Standing "ca." .
My RegEx is like that :
(?i)\b(ca\.)\b
But this doesn't work and I don't know why. Any ideas ?
//Update
I excecute it with:
testSource.replaceAll()
and with
pattern.matcher(testSource).replaceAll().
both doesn´t work.
You must escape the dot and assert a non-word following:
(?i)\bca\.(?=\W)
See live demo.
You should use it like this:
Pattern.compile("(?i)\\b(ca\\.)(?=\\W)").matcher(a).replaceAll("SOME TEXT");
Which if you omit the java escapes gives a regex: (?i)\b(ca\.)\W.
Every \ in normal regex has to be escaped in java - \\.
Also, before a word you have word boundary (\b), but it applies only to a part in String where you have a change from whitespace to a alphanumeric character or the other way around. But in your case you have a dot, which is not an alphanumeric character, so you can't use \b at the end. You can use \W which means that a non-word character is following the dot. But to use \W you need to ignore it in the capture group (so it won't be replaced) - (?=.
Another issue was that you used ., which matches any character, but you actually want to match the real dot, so to do that you have to escape it - \., which in java String becomes \\..

Regex that get rid of all the punctuations at the top and the end of a string

I am trying to come up with a regular expression that gets rid of all the punctuations(if there is one or more) both at the top and the end of a string.
The regex I am using now looks like this:(word is the string I want to convert)
word = word.replaceAll("['?:!.,;]*([a-z]+)['?:!.,;]*", "$1").toLowerCase();
However, I still get some weird cases. For example, 'Amen' goes to 'amen' and ''tis goes to 'tis. Can anyone help me modify it so that 'Amen' will go to amen and ''tis to tis. Thanks in advance!
Replace the following pattern:
^\p{P}+|\p{P}+$
With an empty string.
Demo
\p{P} means any punctuation character. The first part of the regex will remove punctuation at the start, and the second will do it at the end.
In Java you can use:
\\p{Punct}
to identify a punctuation character.
To remove punctuation character from start or end use this:
String word = word.replaceAll("^\\p{Punct}+|\\p{Punct}+$", "");
I couldn't reproduce problem with ''tis becoming 'tis, but problem with 'Amen' is that your regex doesn't accept upper-case characters because [a-z] can accept only lower-case characters. You can change it by adding A-Z to your character class or by making your regex case insensitive with (?i) flag.
So try maybe
replaceAll("['?:!.,;]*([a-zA-Z]+)['?:!.,;]*", "$1")
or
replaceAll("(?i)['?:!.,;]*([a-z]+)['?:!.,;]*", "$1")
You can also change your strategy to just removing punctuations at start of the string or at end of the string. In that case you could just use
replaceAll("^\\p{Punct}+|\\p{Punct}+$","");
where
^ represents start of the string
$ represents end of the string
\\p{Punct} is character class representing punctuation characters (one of !"#$%&'()*+,-./:;<=>?#[]^_`{|}~ characters) but you can use your own ['?:!.,;] class if you want

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?
Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Categories