Regex to remove all whitespace except around keywords and between quotes - java

I want to:
remove all whitespaces unless it's right before or after (0-1 space before and 0-1 after) the predefined keywords (for example: and, or, if then we leave the spaces in " and " or " and" or "and " unchanged)
ignore everything between quotes
I've tried many patterns. The closest I've come up with is pretty close, but it still removes the space after keywords, which I'm trying to avoid.
regex:
\s(?!and|or|if)(?=(?:[^"]*"[^"]*")*[^"]*$)
Test String:
if (ans(this) >= ans({1,2}) and (cond({3,4}) or ans(this) <= ans({5,6})), 7, 8) and {111} > {222} or ans(this) = "hello my friend and or " and(cond({1,2}) $1 123
Ideal result:
if (ans(this)>=ans({1,2}) and (cond({3,4}) or ans(this)<=ans({5,6})),7,8) and {111}>{222} or ans(this)="hello my friend and or " and(cond({1,2})$1123
I then can use str = str.replaceAll in java to remove those whitespaces. I don't mind doing multiple steps to get to the result, but I am not familiar with regex so kinda stuck.
any help would be appreciated!
Note: I edited the result. Sorry about that. For the space around keywords: shrunk to 1 if there are spaces. Either leave it or add 1 space if it's 0 (I just don't want "or ans" becomes "orans", but "and(cond" becomes "and (cond)" is fine (shrink to 1 space before and 1 space after if exists). Ignore everything between quotes.

You make an intelligent use of capturing groups. The general idea here would be
match_this|or_this|or_even_this|(but_capture_this)
In terms of a regular expression this could be
(?:(?:\s+(?:and|or|if)\s+)|"[^"]+")|(\s+)
You'd then need to replace the match only if the first capturing group is not empty.
See a demo on regex101.com (with (*SKIP*)(*FAIL) which serves the same purpose).

You may use
String example = " if (ans(this) >= ans({1,2}) and (cond({3,4}) or ans(this) <= ans({5,6})), 7, 8) and {111} > {222} or ans(this) = \"hello my friend and or \" and(cond({1,2}) $1 123 ";
String rx = "\\s*\\b(and|or|if)\\b\\s*|(\"[^\"]*\")|(\\s+)";
Matcher m = Pattern.compile(rx).matcher(example);
example = m.replaceAll(r -> r.group(3) != null ? "" : r.group(2) != null ? r.group(2) : " " + r.group(1) + " ").trim();
System.out.println( example );
See the Java demo.
The pattern matches
\s*\b(and|or|if)\b\s* - 0+ whitespaces, word boundary, Group 1: and, or, if, word boundary and then 0+ whitespaces
| - or
(\"[^\"]*\") - Group 2: ", any 0+ chars other than " and then a "
| - or
(\s+) - Group 3: 1+ whitespaces.
If Group 3 matches, they are removed, if Group 2 matches, it is put back into the result and if Group 1 matches, it is wrapped with spaces and pasted back. The whole result is .trim()ed.

Related

How do I replace a certain char in between 2 strings using regex

I'm new to regex and have been trying to work this out on my own but I don't seem to get it working. I have an input that contains start and end flags and I want to replace a certain char, but only if it's between the flags.
So for example if the start flag is START and the end flag is END and the char i'm trying to replace is " and I would be replacing it with \"
I would say input.replaceAll(regex, '\\\"');
I tried making a regex to only match the correct " chars but so far I have only been able to get it to match all chars between the flags and not just the " chars. -> (?<=START)(.*)(?=END)
Example input:
This " is START an " example input END string ""
START This is a "" second example END
This" is "a START third example END " "
Expected output:
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "
Find all characters between START and END, and for those characters replace " with \".
To achieve this, apply a replacer function to all matches of characters between START and END:
string = Pattern.compile("(?<=START).*?(?=END)").matcher(string)
.replaceAll(mr -> mr.group().replace("\"", "\\\\\""));
which produces your expected output.
Some notes on how this works.
This first step is to match all characters between START and END, which uses look arounds with a reluctant quantifier:
(?<=START).*?(?=END)
The ? after the .* changes the match from greedy (as many chars as possible while still matching) to reluctant (as few chars as possible while still matching). This prevents the middle quote in the following input from being altered:
START a"b END c"d START e"f END
A greedy quantifier will match from the first START all the way past the next END to the last END, incorrectly including c"d.
The next step is for each match to replace " with \". The full match is group 0, or just MatchResult#group. and we don't need regex for this replacement - just plain string replace is enough (and yes, replace() replaces all occurrences).
For now i've been able to solve it by creating 3 capture groups and continuously replacing the match until there are no more matches left. In this case I even had to insert a replace indentifier because replacing with " would keep the " char there and create an infinite loop. Then when there are no more matches left I replaced my identifier and i'm now getting the expected result.
I still feel like there has to be a way cleaner way to do this using only 1 replace statement...
Code that worked for me:
class Playground {
public static void main(String[ ] args) {
String input = "\"ThSTARTis is a\" te\"\"stEND \" !!!";
String regex = "(.*START.+)\"+(.*END+.*)";
while(input.matches(regex)){
input = input.replaceAll(regex, "$1---replace---$2");
}
String result = input.replace("---replace---", "\\\"");
System.out.println(result);
}
}
Output:
"ThSTARTis is a\" te\"\"stEND " !!!
I would love any suggestions as to how I could solve this in a better/cleaner way.
Another option is to make use of the \G anchor with 2 capture groups. In the replacement use the 2 capture groups followed by \"
(?:(START)(?=.*END)|\G(?!^))((?:(?!START|END)(?>\\+\"|[^\r\n\"]))*)\"
Explanation
(?: Non capture group
(START)(?=.*END) Capture group 1, match START and assert there is END to the right
| Or
\G(?!^) Assert the current position at the end of the previous match
) Close non capture group
( Capture group 2
(?: Non capture group
(?!START|END) Negative lookhead, assert not START or END directly to the right
(?>\\+\"|[^\r\n\"]) Match 1+ times \ followed by " or match any char except " or a newline
)* Close the non capture group and optionally repeat it
) Close group 2
\" Match "
See a Java regex demo and a Java demo
For example:
String regex = "(?:(START)(?=.*END)|\\G(?!^))((?:(?!START|END)(?>\\\\+\\\"|[^\\r\\n\\\"]))*)\\\"";
String string = "This \" is START an \" example input END string \"\"\n"
+ "START This is a \"\" second example END\n"
+ "This\" is \"a START third example END \" \"";
String subst = "$1$2\\\\\"";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = matcher.replaceAll(subst);
System.out.println(result);
Output
This " is START an \" example input END string ""
START This is a \"\" second example END
This" is "a START third example END " "

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

REGEX to format phone number in java

given a phone number with spaces and + allowed, how would you right a regular expression to format it so that non-digits and extra spaces are removed?
I have this so far
String num = " Ken's Phone is + 123 2213 123 (night time)";
System.out.println(num.replaceAll("[^\\d|+|\\s]", "").replaceAll("\\s\\s+", " ").replaceAll("\\+ ", "\\+").trim());
Would you simplify it so that the same result is obtained?
Thank you
I would put trim() first, or at least before you replace every multiple spaces.
Also keep in mind that \s means whitespaces: [ \t\n\x0B\f\r], if you only mean ' ' then use it.
A nicer way to express that you only want at least two spaces to be replaced would be
replaceAll("\\s{2,}", " ")
First extract the number-with-spaces part, then compress multiple spaces to single spaces. then finally remove all spaces that follow a plus sign:
String numberWithSpaces = str.replaceAll("^[^\\d+]*([+\\d\\s]+)[^\\d]*$", "$1").replaceAll("\\s+", " ").replaceAll("\\+\\s*", "+");
I tested this code and it works.
You can simplify it as:
num.replaceAll("[^\\d+\\s]", "") // [^\\d|+|\\s] => [^\\d+\\s]
.replaceAll("\\s{2,}", " ") // \\s\\s+ => \\s{2,}
.replaceAll("\\+\\s", "+") // \\+ => +
.trim()

Regex to find words with letters and numbers separated or not by symbols

I need to build a regex that match words with these patterns:
Letters and numbers:
A35, 35A, B503X, 1ABC5
Letters and numbers separated by "-", "/", "\":
AB-10, 10-AB, A10-BA, BA-A10, etc...
I wrote this regex for it:
\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b
It works partially, but it's match only letters or only numbers separated by symbols.
Example:
10-10, open-office, etc.
And I don't wanna this matches.
I guess that my regex is very repetitive and somewhat ugly.
But it's what I have for now.
Could anyone help me?
I'm using java/groovy.
Thanks in advance.
Interesting challenge. Here is a java program with a regex that picks out the types of "words" you are after:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A35, 35A, B503X, 1ABC5 " +
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +
"10-10, open-office, etc.";
Pattern regex = Pattern.compile(
"# Match special word having one letter and one digit (min).\n" +
"\\b # Match first word having\n" +
"(?=[-/\\\\A-Za-z]*[0-9]) # at least one number and\n" +
"(?=[-/\\\\0-9]*[A-Za-z]) # at least one letter.\n" +
"[A-Za-z0-9]+ # Match first part of word.\n" +
"(?: # Optional extra word parts\n" +
" [-/\\\\] # separated by -, / or //\n" +
" [A-Za-z0-9]+ # Match extra word part.\n" +
")* # Zero or more extra word parts.\n" +
"\\b # Start and end on a word boundary",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(s);
while (regexMatcher.find()) {
System.out.print(regexMatcher.group() + ", ");
}
}
}
Here is the correct output:
A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,
Note that the only complex regexes which are "ugly", are those that are not properly formatted and commented!
Just use this:
([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)
In Java \\ and \/ should be escaped:
([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)
Excuse me to write my solution in Python, I don't know enough Java to write in Java.
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
'[^ ]*' ## there are at least one
'(?(1)\d|[A-Z]))' ## letter and one digit.
'('
'(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])' # start of second group
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,)' # end of second group
')',
re.IGNORECASE) # this group 2 catches the string
.
My solution catches the desired string in the second group: ((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))
.
The part before it verifies that one letter at least and one digit at least are present in the catched string:
(?(1)\d|[A-Z]) is a conditional regex that means "if group(1) catched something, then there must be a digit here, otherwise there must be a letter"
The group(1) is ([A-Z]) in (?=(?:([A-Z])|[0-9])
(?:([A-Z])|[0-9]) is a non-capturing group that matches a letter (catched) OR a digit, so when it matches a letter, the group(1) isn't empty
.
The flag re.IGNORECASE allows to treat strings with upper or lower cased letters.
.
In the second group, I am obliged to write (?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9]) because lookbehind assertions with non fixed length are not allowed. This part signifies one character that can't be '-' preceded by a blank or the head of the string.
At the opposite, (?= |\Z[,) means 'end of string or a comma or a blank after'
.
This regex supposes that the characters '-' , '/' , '\' can't be the first character or the last one of a captured string . Is it right ?
import re
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here) This part verifies that
'[^ ]*' # there are at least one
'(?(1)\d|[A-Z]))' ## (to here) letter and one digit.
'((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,))',
re.IGNORECASE) # this group 2 catches the string
ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR gt4/ui bn\\3K"
print [ mat.group(2) for mat in pat.finditer(ch) ]
s = "A35, 35A, B503X,1ABC5 " +\
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
"10-10, open-office, etc."
print [ mat.group(2) for mat in pat.finditer(s) ]
result
['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']
My first pass yields
(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)
Sorry, but it's not java formatted (you'll need to edit the \ \s etc.). Also, you can't use \b b/c a word boundary is anything that is not alphanumeric and underscore, so I used \s and the start and end of the string.
This is still a bit raw
EDIT
Version 2, slightly better, but could be improved for performance by usin possessive quantifiers. It matches ABC76 AB-32 3434-F etc, but not ABC or 19\23 etc.
((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))
A condition (A OR NOT A) can be omited. So symbols can savely been ignored.
for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
// do something
You didn't mention -x4, 4x-, 4-x-, -4-x or -4-x-, I expect them all to match.
My expression looks just for something-alpha-something-digits-something, where something might be alpha, digits or symbols, and the opposite: something-alpha-something-digits-something. If something else might occur, like !#$~()[]{} and so on, it would get longer.
Tested with scala:
scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
| if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)
Slightly modified to filter matches:
String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }
But I still have an error, which I don't find:
A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|
4x should be 4x-, and -4-x should be -4-x-.

How to group in regex

I have this input string(oid) : 1.2.3.4.5.66.77.88.99.10.52
I want group each number into 3 to like this
Group 1 : 1.2.3
Group 2 : 4.5.66
Group 3 : 77.88.99
Group 4 : 10.52
It should be very dynamic depending on the input. If it has 30 numbers meaning it will return 10 groups.
I have tested using this regex : (\d+.\d+.\d+)
But the result is this
Match 1: 1.2.3
Subgroups:
1: 1.2.3
Match 2: 4.5.66
Subgroups:
1: 4.5.66
Match 3: 77.88.99
Subgroups:
1: 77.88.99
Where as still missed one more matches.
Can anyone help me to provide the Regex. Thank you
\d+(?:\.\d+){0,2}
This is basically the same as Al's final regex - ((?:\d+\.){0,2}\d+) - but I think it's clearer this way. And there's no need to put parentheses around the whole regex. Assuming you're using Matcher.find() to get the matches, you can use group() or group(0) instead of group(1) to retrieve the matched text.
If you want to match up to three digits, you should try:
((?:\d+\.?){1,3})
The {1,3} part matches 1-3 of the preceding item (which is one or more digits followed by a literal .. Note that the dot is escaped so that it doesn't match any character.
Edit
Further explanation: The (?: ) part is a grouping that cannot be used for backreferences (tends to be faster), see section 4.3 here for more information. You could, of course, also just use ((\d+\.?){1,3}) if you prefer. For more information on {1,3}, see here under "Limiting Repetition".
Edit (2)
Fixed error pointed out by dtmunir. An alternative way that is a bit more explicit (and doesn't catch the extra "." at the end of the early groups) is:
((?:\d+\.){0,2}\d+)
Al that will not capture the 52. But this one in fact will:
((?:\d+\.?){1,3})
The only change is adding the question mark after the .
This allows it to accept the last number without having a period after it
Explanation (EDIT):
The \d+ as you can imagine captures consecutive digits.
The \. captures a period
The \.? captures a period, but allows the inner group to not require a period at the end
The (?:\d+\.?) defines "one group" which in your case you want to be 3 numbers.
The {1,3} sets the limits. It requires a minimum of 1 inner group and at most 3 inner groups. These groups may or may not end with a period.
This is my weird code for do this without regex :-)
public static String[] getTokens(String s) {
String[] splitted = s.split("\\.");
//Personally I hate Double.valueOf but I don't know how to avoid it
String[] result = new String[Double.valueOf(Math.ceil(Double.valueOf(splitted.length) / 3)).intValue()];
for (int i = 0, j = 0; j < splitted.length; i++, j+=3) {
//Weird concat
result[i] = splitted[j] + ( j+1 < splitted.length ? "." + splitted[j+1] : "" ) + ( j+2 < splitted.length ? "." + splitted[j+2] : "" );
}
return result;
}

Categories