I've only dabbled in regular expressions and was wondering if someone could help me make a Java regex, which matches a string with these qualities:
It is 1-14 characters long
It consists only of A-Z, a-z and the letters _ or -
The symbol - and _ must be contained only once (together) and not at the start
It should match
Hello-Again
ThisIsValid
AlsoThis_
but not
-notvalid
Not-Allowed-This
Nor-This_thing
VeryVeryLongStringIndeed
I've tried the following regex string
[a-zA-Z^\\-_]+[\\-_]?[a-zA-Z^\\-_]*
and it seems to work. However, I'm not sure how to do the total character limiting part with this approach. I've also tried
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]{1,14}
but it matches (for example) abc-cde_aa which it shouldn't.
This ought to work:
(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}
The regex is quite complex, let my try and explain it.
(?![_-]) negative lookahead. From the start of the string assert that the first character is not _ or -. The negative lookahead "peeks" of the current position and checks that it doesn't match [_-] which is a character group containing _ and -.
(?!(?:.*[_-]){2,}) another negative lookahead, this time matching (?:.*[_-]){2,} which is a non capturing group repeated at least two times. The group is .*[_-], it is any character followed by the same group as before. So we don't want to see some characters followed by _ or - more than once.
[A-Za-z_-]{1,14} is the simple bit. It just says the characters in the group [A-Za-z_-] between 1 and 14 times.
The second part of the pattern is the most tricky, but is a very common trick. If you want to see a character A repeated at some point in the pattern at least X times you want to see the pattern .*A at least X times because you must have
zzzzAzzzzAzzzzA....
You don't care what else is there. So what you arrive at is (.*A){X,}. Now, you don't need to capture the group - this just slows down the engine. So we make the group non-capturing - (?:.*A){X,}.
What you have is that you only want to see the pattern once, so you want not to find the pattern repeated two or more times. Hence it slots into a negative lookahead.
Here is a testcase:
public static void main(String[] args) {
final String pattern = "(?![_-])(?!(?:.*[_-]){2,})[A-Za-z_-]{1,14}";
final String[] tests = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"_NotThis_",
"-notvalid",
"Not-Allow-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (final String test : tests) {
System.out.println(test.matches(pattern));
}
}
Output:
true
true
true
false
false
false
false
false
Things to note:
the character - is special inside character groups. It must go at the start or end of a group otherwise it specifies a range
lookaround is tricky and often counter-intuitive. It will check for matches without consuming, allowing you to test multiple conditions on the same data.
the repetition quantifier {} is very useful. It has 3 states. {X} is repeated exactly X times. {X,} is repeated at least X times. And {X, Y} is repeated between X and Y times.
To check if string is in form XXX-XXX where -XXX or _XXX part is optional you can use
[a-zA-Z]+([-_][a-zA-Z]*)?
which is similar to what you already had
[[a-zA-Z]+[\\-_]?[a-zA-Z]*]
but you made crucial mistake and wrapped it entirely in [...] which makes it character class, and that is not what you wanted.
To check if matched part has only 1-14 length you can use look-ahead mechanism. Just place
(?=.{1,14}$)
at start of your regex to make sure that part from start of match till end of it (represented by $) contains of any 1-14 characters.
So your final regex can look like
String regex = "(?=.{1,14}$)[a-zA-Z]+([-_][a-zA-Z]*)?";
Demo
String [] data = {
"Hello-Again",
"ThisIsValid",
"AlsoThis_",
"-notvalid",
"Not-Allowed-This",
"Nor-This_thing",
"VeryVeryLongStringIndeed",
};
for (String s : data)
System.out.println(s + " : " + s.matches(regex));
Output:
Hello-Again : true
ThisIsValid : true
AlsoThis_ : true
-notvalid : false
Not-Allowed-This : false
Nor-This_thing : false
VeryVeryLongStringIndeed : false
Related
I'm trying to identify strings which contain exactly one integer.
That is exactly one string of contiguous digits e.g. "1234" (no dots, no commas).
So I thought this should do it: (This is with the Java String Escapes included):
(\\d+){1,}
So the "\d+" correctly a string of contiguous digits. (right?)
I included this expression as a sub-expression within "(" and ")" and then I'm trying to say "only one of these sub-expressions.
Here's the result of ( matcher.find() ) of checking various strings:
(note the regex from now on is'raw' here - NOT Java String Escaped).
Pattern:(\d+){1,}
Input String Result
1 true
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 true
It seems the '1' in the pattern is applying to the "+\d" string, rather than the number of those contiguous strings.
Because if I change the number from 1 to 4; I can see the result change to the following:
Pattern:(\d+){4,}
Input String Result
1 false
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 false
What am I missing here ?
Out of interest - if I take off the "(" and ")" altogether - I'm getting a different result again
Pattern:\d+{4,}
Input String Result
1 true
XX-1234 true
do-not-match-no-integers false
do-not-match-1234-567 true
do-not-match-123-456 true
Matcher.find() will try to find a match inside the String. You should try Matcher.matches() instead to see if the pattern fits in all the string.
In this way, the pattern you need is \d+
EDIT:
Seems that I misunderstood the question. One way to find if the String has only one integer, using the same pattern is:
int matchCounter = 0;
while (Matcher.find() || matchCounter < 2){
matchCounter++;
}
return matchCounter == 1
This is the regex:
^[^\d]*\d+[^\d]*$
That's zero or more non digits, followed by a substring of digits and then zero or more non digits again until the end of the string. Here is the java code (with escaped slashes):
class MainClass {
public static void main(String[] args) {
String regex="^[^\\d]*\\d+[^\\d]*$";
System.out.println("1".matches(regex)); // true
System.out.println("XX-1234".matches(regex)); // true
System.out.println("XX-1234-YY".matches(regex)); // true
System.out.println("do-not-match-no-integers".matches(regex)); // false
System.out.println("do-not-match-1234-567".matches(regex)); // false
System.out.println("do-not-match-123-456".matches(regex)); // false
}
}
You can use the RegEx ^\D*?(\d+)\D*?$
^\D*? makes sure there is no digits between the start of your line and your first group
(\d+) matches your digits
\D*?$ makes sure there is no digits between the your first group and the end of your line
Demo.
So, for your Java String, it would be : ^\\D*?(\\d+)\\D*?$
I think you will have to make sure your regex considers the entire string, using ^ and $.
To do that, you could match zero or more non-digits, followed by 1 or more digits, and then zero or more non-digits.
The following should do the trick:
^[^\d]*(\d+)[^\d]*$
Here it is on regex101.com: https://regex101.com/r/CG0RiL/2
Edit: As pointed out by Veselin Davidov my regex isn't correct.
If i understand you right you want it only to say true when the entire String matches the pattern. yes?
Then you have to call matcher.matches();
Also i think your pattern must be just \d+.
If you have problem with regex i can recommend you https://regex101.com/ it explains you why it matches something and gives you a quick preview.
I use it every time i have to write regex.
The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.
I would like to create a matching pattern for a situation like this
DOMAIN+("Y|A")?
I would like the matching options to be only
DOMAIN
DOMAINY
DOMAINA
but seems like DOMAINX, DOMAINY etc. are matching as well.
Yes, they are matching because you did not specify that the String needed to end with this. DOMAIN(Y|A)? is matching DOMAINX because it rightfully contains DOMAIN followed by nothing (which is accepted since ? validates 0 or 1 occurence).
You can add this restriction by specifying $ at the end of the regular expression.
Sample code that shows the result of matches. In your full code, you probably want to compile a Pattern instead of doing it each time.
public static void main(String[] args) {
String regex = "DOMAIN(Y|A)?$";
System.out.println("DOMAIN".matches(regex)); // prints true
System.out.println("DOMAINX".matches(regex)); // prints false
System.out.println("DOMAINY".matches(regex)); // prints true
System.out.println("DOMAINA".matches(regex)); // prints true
}
You could use word boundaries, \b, in order to prevent strings such as "DOMAINX" from being matched.
If you just want to handle cases where there are characters after the word, add \b to the end:
DOMAIN(?:Y|A)?\b
Otherwise, you could place \b around the expression to handle cases where there may be characters at the start/end:
\bDOMAIN(?:Y|A)?\b
I also made (?:Y|A) a non-capturing group and I removed the quotes.
See the matches here.
However, as your title implies, if you only want to handle characters at the end of a line, use the $ anchor at the end of your expression:
DOMAIN(?:Y|A)?$
You may have to add the m (multi-line) flag so that the anchor matches at the start/end of a line rather than at the start/end of the string:
(?m)DOMAIN(?:Y|A)?$
You need this
DOMAIN(Y|A)?
If you need it to be a word in text you should anchor it with \b as Josh shows.
Your regex does the following
DOMAIN+("Y|A")?
DOMAIN+("Y|A")?
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
[Match the character string “DOMAI” literally (case sensitive)][1] DOMAI
[Match the character “N” literally (case sensitive)][1] N+
[Between one and unlimited times, as many times as possible, giving back as needed (greedy)][2] +
[Match the regex below and capture its match into backreference number 1][3] ("Y|A")?
[Between zero and one times, as many times as possible, giving back as needed (greedy)][4] ?
[Match this alternative (attempting the next alternative only if this one fails)][5] "Y
[Match the character string “"Y” literally (case sensitive)][1] "Y
[Or match this alternative (the entire group fails if this one fails to match)][5] A"
[Match the character string “A"” literally (case sensitive)][1] A"
How can i write this as a regular expression?
"blocka#123#456"
i have used # symbol to split the parameters in the data
and the parameters are block name,startX coordinate,start Y corrdinate
this is the data embedded in my QR code.so when i scan the QR i want to check if its the right QR they're scanning. For that i need a regular expression for the above syntax.
my method body
public void Store_QR(String qr){
if( qr.matches(regular Expression here)) {
CurrentLocation = qr;
}
else // Break the operation
}
The Information you specified does not justice using a regular expression at all.
Try to from it in a more general way.
If you really need to scan for "blocka#123#456" then use qr.contains("blocka#123#456");
It depends on what you want to match.
Here are some regex propositions:
^blocka#[0-9]{3}#[0-9]{3}$
^blocka#[0-9]+#[0-9]+$
^blocka(#[0-9]{3}){2}$
^blocka(#[0-9]+){2}$
^blocka(#[0-9]{3})+$
^blocka(#[0-9]+)+$
Otherwise, just use contains() or similar.
myregexp.com is nice to do some testing.
Official Java Regex Tutorial is quite ok to learn and includes most things one needs to know.
The Pattern documentation also includes fancy predefined character classes that are missing in above tutorial.
You did not specify anything that has to be regular in that example you gave. Regular expressions make only sense if there are rules to validate the input.
If it has to be exactly "blocka#123#456" then "blocka#123#456" or "^blocka#123#456$" will work as regex. Stuff between ^ and $ means that the regex inside must span from begin to end of the input. Sometimes required and usually a good idea to put that around your regex.
If blocka is dynamic replace it with [a-z]+ to match any sequence of lowercase letters a through z with length of at least 1. block[a-z] would match blocka, blockb, etc.
And [a-z]{6} would match any sequence of exactly 6 letters. [a-zA-Z] also includes uppercase letters and \p{L} matches any letter including unicode stuff (e.g. Blüc本).
# matches #. Like any character without special regex meaning ( \ ^ $ . | ? * + ( ) [ ] { } ) characters match themselves. [^#] matches every character but #.
Regarding the numbers: [0-9]+ or \d+ is a generic pattern for several numbers, [0-9]{1,4} would match anything consisting out of 1-4 numbers like 007, 5, 9999. (?:0|[1-9][0-9]{0,3}) for example will only match numbers between 0 and 9999 and does not allow leading zeros. (?:STUFF) is a non-capturing group that does not affect the groups you can extract via Matcher#group(1..?). Useful for logical grouping with |. The meaning of (?:0|[1-9][0-9]{0,3}) is: either a single 0 OR ( 1x 1-9 followed by 0 to 3 x 0-9).
[0-9] is so common that there is a predefinition for it : \d (digit). It's \\d inside the regex String since you have to escape the \.
So some of your options are
".*" which matches absolutely everything
"^[^#]+(?:#[^#]+)+$" which matches anything separated by # like "hello #world!1# -12.f #本#foo#bar"
"^blocka(#\\d+)+$" which matches blocka followed by at least one group of numbers separated by # e.g. blocka#1#12#0007#949432149#3
"^blocka#(?:[0-9]|[1-9][0-9]|[1-3][0-9]{2})#[4-9][0-9]{2}$" which will match only if it finds blocka# followed by numbers 0 - 399, followed by a # and finally numbers 400-999
"^blocka#123#456$" which matches only exactly that string.
All that are regular expressions that match the example you gave.
But it's probably as simple as
public void Store_QR(String qr){
if( qr.matches("^blocka#\\d+#\\d+$")) {
CurrentLocation = qr;
}
else // Break the operation
}
or
private static final Pattern QR_PATTERN = Pattern.compile("^blocka#(\\d+)#(\\d+)$");
public void Store_QR(String qr){
Matcher matcher = QR_PATTERN.matcher(qr);
if(matcher.matches()) {
int number1 = Integer.valueOf(matcher.group(1));
int number2 = Integer.valueOf(matcher.group(2));
CurrentLocation = qr;
}
else // Break the operation
}
BlockName#start_X#start_Y any block name.. starting with the string"block" and followed by two integers
I guess a good regex for that would be "^block\\w+#\\d+#\\d+$", starting with "block", then any combination of a-z, A-Z, 0-9 and _ (thats the \w) followed by #, numbers, #, numbers.
Would match block_#0#0, blockZ#9#9, block_a_Unicorn666#0000#1234, but not block#1#2 because there is no name at all and would not match blockName#123#abc because letters instead of number. Would also not match Block_a#123#456 because of the uppercase B.
If the name part (\\w+) is too liberal (___, _123 would be a legal names) use e.g. "^block_?[a-zA-Z]+#\\d+#\\d+$", what won't allow numbers and names may only be separated by a single optional _ and there have to be letters after that. Would allow _a, a, _ABc, but not _, _a_b, _a9. If you want to allow numbers in names [a-zA-Z0-9] would be the character class to use.
I suggest:
[a-z]+#\d+#\d+
And if you want capture the 3 parts:
([a-z]+)#(\d+)#(\d+)
Matcher.group( 1, 2 or 3 ) returns the parts
I don't know if this is possible using regex. I'm just asking in case someone knows the answer.
I have a string ="hellohowareyou??". I need to split it like this
[h, el, loh, owar, eyou?, ?].
The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.
public ArrayList<String> splitString(String s)
{
int cnt=0,i;
ArrayList<String> sList=new ArrayList<String>();
for(i=0;i+cnt<s.length();i=i+cnt)
{
cnt++;
sList.add(s.substring(i,i+cnt));
}
sList.add(s.substring(i,s.length()));
return sList;
}
I was just curious whether such a thing can be done using regex.
Solution
The following snippet generates the pattern that does the job (see it run on ideone.com):
// splits at indices that are triangular numbers
class TriangularSplitter {
// asserts that the prefix of the string matches pattern
static String assertPrefix(String pattern) {
return "(?<=(?=^pattern).*)".replace("pattern", pattern);
}
// asserts that the entirety of the string matches pattern
static String assertEntirety(String pattern) {
return "(?<=(?=^pattern$).*)".replace("pattern", pattern);
}
// repeats an assertion as many times as there are dots behind current position
static String forEachDotBehind(String assertion) {
return "(?<=^(?:.assertion)*?)".replace("assertion", assertion);
}
public static void main(String[] args) {
final String TRIANGULAR_SPLITTER =
"(?x) (?<=^.) | measure (?=(.*)) check"
.replace("measure", assertPrefix("(?: notGyet . +NBefore +1After)*"))
.replace("notGyet", assertPrefix("(?! \\1 \\G)"))
.replace("+NBefore", forEachDotBehind(assertPrefix("(\\1? .)")))
.replace("+1After", assertPrefix(".* \\G (\\2?+ .)"))
.replace("check", assertEntirety("\\1 \\G \\2 . \\3"))
;
String text = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
System.out.println(
java.util.Arrays.toString(text.split(TRIANGULAR_SPLITTER))
);
// [a, bc, def, ghij, klmno, pqrstu, vwxyzAB, CDEFGHIJ, KLMNOPQRS, TUVWXYZ]
}
}
Note that this solution uses techniques already covered in my regex article series. The only new thing here is \G and forward references.
References
This is a brief description of the basic regex constructs used:
(?x) is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and # can be used for comments).
^ and $ are the beginning and end-of-the-line anchors. \G is the end-of-previous match anchor.
| denotes alternation (i.e. "or").
? as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g. .*? it denotes that the * (i.e. zero-or-more of) repetition is reluctant/non-greedy.
(…) are used for grouping. (?:…) is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g. \1).
(?=…) is a positive lookahead; it looks to the right to assert that there's a match of the given pattern.(?<=…) is a positive lookbehind; it looks to the left.
(?!…) is a negative lookahead; it looks to the right to assert that there isn't a match of a pattern.
Related questions
Articles in the [nested-reference] series:
How does this regex find triangular numbers?
How can we match a^n b^n with Java regex?
How does this Java regex detect palindromes?
How does the regular expression (?<=#)[^#]+(?=#) work?
Explanation
The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:
(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot away
This matches at index 1, and is a crucial starting point to the rest of the process
Otherwise, we measure to reconstruct how the last match was made (using \G as reference point), storing the result of the measurement in "before" \G and "after" \G capturing groups. We then check if the current position is the one prescribed by the measurement to find where the next match should be made.
Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn't have custom-named groups, but here are the semantics for the 3 capturing groups:
\1 captures the string "before" \G
\2 captures some string "after" \G
If the length of \1 is e.g. 1+2+3+...+k, then the length of \2 needs to be k.
Hence \2 . has length k+1 and should be the next part in our split!
\3 captures the string to the right of our current position
Hence when we can assertEntirety on \1 \G \2 . \3, we match and set the new \G
You can use mathematical induction to rigorously prove the correctness of this algorithm.
To help illustrate how this works, let's work through an example. Let's take abcdefghijklm as input, and say that we've already partially splitted off [a, bc, def].
\G we now need to match here!
↓ ↓
a b c d e f g h i j k l m n
\____1____/ \_2_/ . \__3__/ <--- \1 G \2 . \3
L=1+2+3 L=3
Remember that \G marks the end of the last match, and it occurs at triangular number indices. If \G occured at 1+2+3+...+k, then the next match needs to be k+1 positions after \G to be a triangular number index.
Thus in our example, given where \G is where we just splitted off def, we measured that k=3, and the next match will split off ghij as expected.
To have \1 and \2 be built according to the above specification, we basically do a while "loop": for as long as it's notGyet, we count up to k as follows:
+NBefore, i.e. we extend \1 by one forEachDotBehind
+1After, i.e. we extend \2 by just one
Note that notGyet contains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until \1 "hits" \G.
Conclusion
Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with \G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren't so "forgetful"), by painstakingly building strings by appending one character at a time (which is O(N^2)). Each simple measurement is linear instead of constant time (since it's done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).
There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.
That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.
Regex purpose is to recognize patterns. Here you doesn't search for patterns but for a length split. So regex are not appropriate.
It is propably possible, but not with a single regex : to find the first n characters using a regex, you use: "^(.{n}).*"
So, you can search with that regex for the 1st character.
Then, you make a substring, and you search for the 2 next characters.
Etc.
Like #splash said, it will make the code more complicated, and unefficient, since you use regex for something outside of their purpose.
String a = "hellohowareyou??";
int i = 1;
while(true) {
if(i >= a.length()) {
System.out.println(a);
break;
}
else {
String b = a.substring(i++);
String[] out = a.split(Pattern.quote(b) + "$");
System.out.println(out[0]);
a = b;
if(b.isEmpty())
break;
}
}