Multiple underscores at the end using REGEX? [duplicate] - java

http://regexr.com/3ars8
^(?=.*[0-9])(?=.*[A-z])[0-9A-z-]{17}$
Should match "17 alphanumeric chars, hyphens allowed too, must include at least one letter and at least one number"
It'll correctly match:
ABCDF31U100027743
and correctly decline to match:
AB$DF31U100027743
(and almost any other non-alphanumeric char)
but will apparently allow:
AB^DF31U100027743

Because your character class [A-z] matches this symbol.
[A-z] matches [, \, ], ^, _, `, and the English letters.
Actually, it is a common mistake. You should use [a-zA-Z] instead to only allow English letters.
Here is a visualization from Expresso, showing what the range [A-z] actually covers:
So, this regex (with i option) won't capture your string.
^(?=.*[0-9])(?=.*[a-z])[0-9a-z-]{17}$
In my opinion, it is always safer to use Ignorecase option to avoid such an issue and shorten the regex.

regex uses ASCII printable characters from the space to the tilde range.
Whenever we use [A-z] token it matches the following table highlighted characters. If we use [ -~] token it matches starting from SPACE to tilde.

You're allowing A-z (capital 'A' through lower 'z'). You don't say what regex package you're using, but it's not necessarily clear that A-Z and a-z are contiguous; there could be other characters in between. Try this instead:
^(?=.*[0-9])(?=.*[A-Za-z])[0-9A-Za-z-]{17}$
It seems to meet your criteria for me in regexpal.

Related

What is the correct way to make a regex that accepts all letters and a selection of characters?

I need a Regex that accepts all letters(lowercase & uppercase), numbers and these characters/symbols ('-','_','#','.'). It is not required to be in the form of an Email address. The characters can be positioned anywhere in the word. It also should not accept spaces and the word length must be 8 or more.
This is what I have so far.
^(?=\S{8})[a-zA-Z]\w*(?:\.\w+)*(?:#\w+\.\w{2,4})?$
You may use the following regex:
^[a-zA-Z0-9_#.-]{8,}$
Details
^ - start of string
[a-zA-Z0-9_#.-]{8,} - 8 or more ASCII letters, digits, ., _, # or -
$ - end of string.
See the regex demo.
Watch out for \w in Android, it matches all Unicode letters and digits by default (but not in Java).
In Android/Java, when using it with .matches(), you may remove the ^ and $ anchors as this method requires a full string match:
String regex = "[a-zA-Z0-9_#.-]{8,}";
[a-zA-Z._#\d-]{8,}
should do the trick. With additional boundaries that'd be ^[a-zA-Z._#\d-]{8,}$.
^ Beginning of the line
[a-zA-Z._#\d-] Group of the characters a-z, A-Z and ., -, _, # as per your question as well as numbers (\d)
{8,} 8 to unlimited times
$ End of the word
You can try it out on regex101.com here.
An even shorter solution would be [\w.#-]{8,} using \w as shortcut for [a-zA-Z0-9_]. regex101 Not correct! Thanks to #Wiktor Stribiżew for the correction; see comments for more.
Here's another form for alphanumeric character:
^[[:alnum:]._#\-]{8,}$

regular expression to validate 2 alphanumerics

I have the follow pattern to validate a string, it has to validate 4 letters, 6 numbers, 6 letters and 2 alphanumerics, but with my current pattern I cant get a valid test
Pattern.compile("[A-Za-z]{4}\\d{6}\\w{6}\\[A-ZÑa-zñ0-9\\- ]{2}");
I think my pattern it's wrong, because I'm not shure about this [A-ZÑa-zñ0-9\\- ]{2}
Can you please help me?
You can use pattern:
^[a-zA-Z]{4}[0-9]{6}[a-zA-Z]{6}[a-zA-Z0-9]{2}$
Check it live here.
In your expression you are using \w+, which does not only match digits and alphabetic characters, but also underscores _.
A few things off on your regex.
You have extra backslashes in your digit and word matching. Change from \\d to \d and \\w to \w.
The \\ is not needed.
Your end regex is invalid syntax. Just remove the "\\- " bit.
You can also slim down your initial part to be \w instead of [A-Za-z]. So, you're new regex should look like:
"\w{4}\d{6}\w{6}[A-ZÑa-zñ0-9]{2}"
That is if you're okay with the only non-ascii characters being Ñ and ñ in your last two alphanumerics.

What is the equivalent in .Net of the :print: character class from PHP or Java? [duplicate]

Is there a special regex statement like \w that denotes all printable characters? I'd like to validate that a string only contains a character that can be printed--i.e. does not contain ASCII control characters like \b (bell), or null, etc. Anything on the keyboard is fine, and so are UTF chars.
If there isn't a special statement, how can I specify this in a regex?
Very late to the party, but this regexp works: /[ -~]/.
How? It matches all characters in the range from space (ASCII DEC 32) to tilde (ASCII DEC 126), which is the range of all printable characters.
If you want to strip non-ASCII characters, you could use something like:
$someString.replace(/[^ -~]/g, '');
NOTE: this is not valid .net code, but an example of regexp usage for those who stumble upon this via search engines later.
If your regex flavor supports Unicode properties, this is probably the best the best way:
\P{Cc}
That matches any character that's not a control character, whether it be ASCII -- [\x00-\x1F\x7F] -- or Latin1 -- [\x80-\x9F] (also known as the C1 control characters).
The problem with POSIX classes like [:print:] or \p{Print} is that they can match different things depending on the regex flavor and, possibly, the locale settings of the underlying platform. In Java, they're strictly ASCII-oriented. That means \p{Print} matches only the ASCII printing characters -- [\x20-\x7E] -- while \P{Cntrl} (note the capital 'P') matches everything that's not an ASCII control character -- [^\x00-\x1F\x7F]. That is, it matches any ASCII character that isn't a control character, or any non-ASCII character--including C1 control characters.
TLDR Answer
Use this Regex...
\P{Cc}\P{Cn}\P{Cs}
Working Demo
In this demo, I use this regex to search the string "Hello, World!_". I'm going to add a weird character at the end, (char)4 — this is the character for END TRANSMISSION.
using System;
using System.Text.RegularExpressions;
public class Test {
public static void Main() {
// your code goes here
var regex = new Regex(#"![\P{Cc}\P{Cn}\P{Cs}]");
var matches = regex.Matches("Hello, World!" + (char)4);
Console.WriteLine("Results: " + matches.Count);
foreach (Match match in matches) {
Console.WriteLine("Result: " + match);
}
}
}
Full Working Demo at IDEOne.com
TLDR Explanation
\P{Cc} : Do not match control characters.
\P{Cn} : Do not match unassigned characters.
\P{Cs} : Do not match UTF-8-invalid characters.
Alternatives
\P{C} : Match only visible characters. Do not match any invisible characters.
\P{Cc} : Match only non-control characters. Do not match any control characters.
\P{Cc}\P{Cn} : Match only non-control characters that have been assigned. Do not match any control or unassigned characters.
\P{Cc}\P{Cn}\P{Cs} : Match only non-control characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, or UTF-8-invalid characters.
\P{Cc}\P{Cn}\P{Cs}\P{Cf} : Match only non-control, non-formatting characters that have been assigned and are UTF-8 valid. Do not match any control, unassigned, formatting, or UTF-8-invalid characters.
Source and Explanation
Take a look at the Unicode Character Properties available that can be used to test within a regex. You should be able to use these regexes in Microsoft .NET, JavaScript, Python, Java, PHP, Ruby, Perl, Golang, and even Adobe. Knowing Unicode character classes is very transferable knowledge, so I recommend using it!
All Matchable Unicode Character Sets
If you want to know any other character sets available, check out regular-expressions.info...
\p{L} or \p{Letter}: any kind of letter from any language.
\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase
\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another
character without taking up extra space (e.g. accents, umlauts, etc.).
\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
\p{Me} or \p{Enclosing_Mark}: a character that encloses the character it is combined with (circle, square, keycap, etc.).
\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
\p{Zl} or \p{Line_Separator}: line separator character U+2028.
\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or \p{Math_Symbol}: any mathematical symbol.
\p{Sc} or \p{Currency_Symbol}: any currency sign.
\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
\p{N} or \p{Number}: any kind of numeric character in any script.
\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
\p{P} or \p{Punctuation}: any kind of punctuation character.
\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C} or \p{Other}: invisible control characters and unused code points.
\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf} or \p{Format}: invisible formatting indicator.
\p{Co} or \p{Private_Use}: any code point reserved for private use.
\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.
There is a POSIX character class designation [:print:] that should match printable characters, and [:cntrl:] for control characters. Note that these match codes throughout the ASCII table, so they might not be suitable for matching other encodings.
Failing that, the expression [\x00-\x1f] will match through the ASCII control characters, although again, these could be printable in other encodings.
In Java, the \p{Print} option specifies the printable character class.
It depends wildly on what regex package you are using. This is one of these situations about which some wag said that the great thing about standards is there are so many to choose from.
If you happen to be using C, the isprint(3) function/macro is your friend.
Adding on to #Alan-Moore, \P{Cc} is actually as example of Negative Unicode Category or Unicode Block (ref: Character Classes in Regular Expressions). \P{name} matches any character that does not belong to a Unicode general category or named block. See the referred link for more examples of named blocks supported in .Net

How should this regular expression mentioned in the App Engine documentation be interpreted?

While reading through the App Engine documentation for Java, I came across this regular expression:
[0-9A-Za-z._-]{0,100}. I read the Wikipedia page for regular expressions but still could not properly decode this one.
The App Engine documentation mentions the following about valid strings for namespaces:
If you do not specify a value for namespace, the namespace is set to an empty string. The namespace string is arbitrary, but also limited to a maximum of 100 alphanumeric characters, periods, underscores, and hyphens. More explicitly, namespace strings must match the regular expression [0-9A-Za-z._-]{0,100}.
Can someone please help in breaking down the regular expression to help me understand how the pattern mentioned in the regular expression satisfies the prerequisites for a namespace mentioned above?
As always, thanks a lot for helping out!!
Teach a man how to fish
Everyone here will probably tell you to dump this expression into a tool such as regex101.
You will not only learn what your expression means, but also see how tweaking parts of it changes the result.
Another popular online tool here is the Debuggex visualizations.
Debuggex Demo
Square brackets indicate that any of the characters inside the brackets can be used. This is called a character class.
[abc] would match "a", "b" or "c" but not "d".
You can also specify a range within a character class to indicate that any of the characters in the range should match.
[a-e] means the same as [abcde]
In your regular expression, [0-9A-Za-z._-] matches an alphanumeric character, period, underscore or hyphen. The three ranges 0-9, A-Z and a-z cover the numerals, lowercase and uppercase letters respectively.
Curly brackets indicate that the preceding character can be matched multiple times.
a{3,5} means "the character 'a', repeated 3-5 times".
I.e. it matches "aaa" and "aaaaa" but not "aa" or "aaaaaa".
We can combine the curly braces with the character class to indicate we want to match any character in the character class multiple times.
[ab]{0, 5} means "a mix of 'a' and 'b', between zero and five characters long"
I.e. it matches "aa", "bbb", "ababa" and "" but not "ababab" or "abc"
Combining these two concepts we can see how the regex matches the text description
[0-9A-Za-z._-]{0,100} means "a mix of 0-9, A-Z, a-z, ., _ and -, between zero and a hundred characters long"
Generally the square brackets mean "one of the contents"
0-9, A-Z, a-z, you could probably figure out what they mean. These are ranges that you can configure (so if you wanted you can do 3-7, etc.)
._- means "period, underscore, or hyphen"
So [0-9A-Za-z._-] should mean "one of either an alphanumeric character, period, underscore, or hyphen"
{0,100} just gives the number of times the preceding group (I think that might be the term?) can appear (so in this case, 0 to 100 times, inclusive (I think))
Edit: Take a look at #zx81's answer too! His suggestion will be a lot more useful in the long run than my answer.

regular expression validating string

I tried using this pattern
^[A-z]*[A-z,-, ]*[A-z]*
To match against a string that starts with multiple alpha characters (a-z) followed by multiple hyphens or spaces and ends with alpha characters, eg:
Azasdas- - sa-as
But it does not work.
Try ^[A-Za-z][A-Za-z -]*[A-Za-z]$
^ indicates that the word should start with alphabets (A-Z or a-z) and then followed by any number of alphabets or hyphens. And then end with alphabets denoted by $ .
Also, you should not be using A-z because this will include unintended characters from ASCII range 91 to 96. See this table
Don't use ',' (comma)
^[A-z]*[A-z- ]*[A-z]*
You don't want the commas, in a character range you also need to specify [A-Za-z\- ] because the ASCII for A-Z and a-z aren't contiguous. You're missing some allowable spaces, and your last expression needs to account for the hypen.
You need something closer to this:
^([A-Za-z]*)-\s*([A-Za-z][A-Za-z -]*)([A-Za-z-]*)$
Depending on how you actually want to break things up. Without knowing the context behind the "chunks", it may or may not just be easier to split it apart on hyphens.
Edit
Actually, it's more like:
^([A-Za-z]*)([- ]*)([A-Za-z-]*)$
This is a word, followed by arbitrary spaces and hyphens, followed by a word that may contain a hyphen.
The currently accepted answer (^[A-Za-z][A-Za-z-]*[A-Za-z]$) will only match strings that are at least two characters long--for example, it will match the string "AB", but not just "A" or "B". Compare that to this regex:
^[A-Za-z]+([ -]+[A-Za-z]+)*$
By grouping the [ -]+ and the second [A-Za-z]+ together I'm saying, if there are any spaces and/or hyphens, they must be followed by more letters. The * quantifier on the group makes it optional, so "A" will match, while still meeting the requirement that the string start and end with a letter.

Categories