regular expression validating string - java

I tried using this pattern
^[A-z]*[A-z,-, ]*[A-z]*
To match against a string that starts with multiple alpha characters (a-z) followed by multiple hyphens or spaces and ends with alpha characters, eg:
Azasdas- - sa-as
But it does not work.

Try ^[A-Za-z][A-Za-z -]*[A-Za-z]$
^ indicates that the word should start with alphabets (A-Z or a-z) and then followed by any number of alphabets or hyphens. And then end with alphabets denoted by $ .
Also, you should not be using A-z because this will include unintended characters from ASCII range 91 to 96. See this table

Don't use ',' (comma)
^[A-z]*[A-z- ]*[A-z]*

You don't want the commas, in a character range you also need to specify [A-Za-z\- ] because the ASCII for A-Z and a-z aren't contiguous. You're missing some allowable spaces, and your last expression needs to account for the hypen.
You need something closer to this:
^([A-Za-z]*)-\s*([A-Za-z][A-Za-z -]*)([A-Za-z-]*)$
Depending on how you actually want to break things up. Without knowing the context behind the "chunks", it may or may not just be easier to split it apart on hyphens.
Edit
Actually, it's more like:
^([A-Za-z]*)([- ]*)([A-Za-z-]*)$
This is a word, followed by arbitrary spaces and hyphens, followed by a word that may contain a hyphen.

The currently accepted answer (^[A-Za-z][A-Za-z-]*[A-Za-z]$) will only match strings that are at least two characters long--for example, it will match the string "AB", but not just "A" or "B". Compare that to this regex:
^[A-Za-z]+([ -]+[A-Za-z]+)*$
By grouping the [ -]+ and the second [A-Za-z]+ together I'm saying, if there are any spaces and/or hyphens, they must be followed by more letters. The * quantifier on the group makes it optional, so "A" will match, while still meeting the requirement that the string start and end with a letter.

Related

How should this regular expression mentioned in the App Engine documentation be interpreted?

While reading through the App Engine documentation for Java, I came across this regular expression:
[0-9A-Za-z._-]{0,100}. I read the Wikipedia page for regular expressions but still could not properly decode this one.
The App Engine documentation mentions the following about valid strings for namespaces:
If you do not specify a value for namespace, the namespace is set to an empty string. The namespace string is arbitrary, but also limited to a maximum of 100 alphanumeric characters, periods, underscores, and hyphens. More explicitly, namespace strings must match the regular expression [0-9A-Za-z._-]{0,100}.
Can someone please help in breaking down the regular expression to help me understand how the pattern mentioned in the regular expression satisfies the prerequisites for a namespace mentioned above?
As always, thanks a lot for helping out!!
Teach a man how to fish
Everyone here will probably tell you to dump this expression into a tool such as regex101.
You will not only learn what your expression means, but also see how tweaking parts of it changes the result.
Another popular online tool here is the Debuggex visualizations.
Debuggex Demo
Square brackets indicate that any of the characters inside the brackets can be used. This is called a character class.
[abc] would match "a", "b" or "c" but not "d".
You can also specify a range within a character class to indicate that any of the characters in the range should match.
[a-e] means the same as [abcde]
In your regular expression, [0-9A-Za-z._-] matches an alphanumeric character, period, underscore or hyphen. The three ranges 0-9, A-Z and a-z cover the numerals, lowercase and uppercase letters respectively.
Curly brackets indicate that the preceding character can be matched multiple times.
a{3,5} means "the character 'a', repeated 3-5 times".
I.e. it matches "aaa" and "aaaaa" but not "aa" or "aaaaaa".
We can combine the curly braces with the character class to indicate we want to match any character in the character class multiple times.
[ab]{0, 5} means "a mix of 'a' and 'b', between zero and five characters long"
I.e. it matches "aa", "bbb", "ababa" and "" but not "ababab" or "abc"
Combining these two concepts we can see how the regex matches the text description
[0-9A-Za-z._-]{0,100} means "a mix of 0-9, A-Z, a-z, ., _ and -, between zero and a hundred characters long"
Generally the square brackets mean "one of the contents"
0-9, A-Z, a-z, you could probably figure out what they mean. These are ranges that you can configure (so if you wanted you can do 3-7, etc.)
._- means "period, underscore, or hyphen"
So [0-9A-Za-z._-] should mean "one of either an alphanumeric character, period, underscore, or hyphen"
{0,100} just gives the number of times the preceding group (I think that might be the term?) can appear (so in this case, 0 to 100 times, inclusive (I think))
Edit: Take a look at #zx81's answer too! His suggestion will be a lot more useful in the long run than my answer.

How to make a regular expression that matches tokens with delimiters and separators?

I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Java regex mix two patterns

How can i get this pattern to work:
Pattern pattern = Pattern.compile("[\\p{P}\\p{Z}]");
Basically, this will split my String[] sentence by any kind of punctuation character (p{P} or any kind of whitespace (p{Z}). But i want to exclude the following case:
(?<![A-Za-z-])[A-Za-z]+(?:-[A-Za-z]+){1,}(?![A-Za-z-])
pattern explained here: Java regex patterns
which are the hyphened words like this: "aaa-bb", "aaa-bb-cc", "aaa-bb-c-dd". SO, i can i do that?
Unfortunately it seems like you can't merge both expressions, at least as far as I know.
However, maybe you can reformulate your problem.
If, for example, you want to split between words (which can contain hyphens), try this expression:
(?>[^\p{L}-]+|-[^\p{L}]+|^-|-$)
This should match any sequence of non-letter characters that are not a minus or any minus that is followed my a non-letter character or that is the first or last character in the input.
Using this expression for a split should result in this:
input="aaa-bb, aaa-bb-cc, aaa-bb-c-dd,no--match,--foo"
ouput={"aaa-bb","aaa-bb-cc","aaa-bb-c-dd","no","match","","foo"}
The regex might need some additional optimization but it is a start.
Edit: This expression should get rid of the empty string in the split:
(?>[^\p{L}-][^\p{L}]*|-[^\p{L}]+|^-|-$)
The first part would now read as "any non-character which is not a minus followed by any number of non-character characters" and should match .-- as well.
Edit: in case you want to match words that could potentially contain hyphens, try this expression:
(?>(?<=[^-\p{L}])|^)\p{L}+(?:-\p{L}+)*(?>(?=[^-\p{L}])|$)
This means "any sequence of letters (\p{L}+) followed by any number of sequences consisting of one minus and at least one more letters ((?:-\p{L}+)*+). That sequence must be preceeded by either the start or anything not a letter or minus ((?>(?<=[^-\p{L}])|^)) and be followed by anything that is not a letter or minus or the end of the input ((?>(?=[^-\p{L}])|$))".

Regex matching "four characters, followed by an unknown No of digits."

An example of how the Strings may look:
TADE000177
TADE007,daFG
TADE0277 DFDFG
It's a little unclear what you want.
If you mean four capital letters from A to Z, followed by at least one digit in 0-9 you could try this:
"^[A-Z]{4}[0-9]+"
If instead of capital letters you want to allow any character except newline change [A-Z] to ..
If you want to also allow zero digits change the + to a *.
Exactly four characters followed by 1 or more digits: [A-Z]{4}\d+
Remember to escape the backslash if you put it in a string literal.
Breakdown:
[A-Z]…: An upper case letter, equivalent to \p{Upper}
To also include lower case letters, you could instead use [A-Za-z] or \p{Alpha}
…{4}… exactly 4 times
…\d…+ a digit
…+ 1 or more times
To allow 0 digits, you could change to *.
If i understood correctly what you're asking for you can try: .{4}\d*
^\w{4}.*$
Matches a string starting with 4 characters followed by any number of any other charcters.
Your examples include spaces and punctuation, if you know exactly which characters are allowed then you might want to use this pattern.
^\w{4}[A-z\d<other known characters go here>]*$
Remember to remove the < and > too :)

Categories