I have this input string(oid) : 1.2.3.4.5.66.77.88.99.10.52
I want group each number into 3 to like this
Group 1 : 1.2.3
Group 2 : 4.5.66
Group 3 : 77.88.99
Group 4 : 10.52
It should be very dynamic depending on the input. If it has 30 numbers meaning it will return 10 groups.
I have tested using this regex : (\d+.\d+.\d+)
But the result is this
Match 1: 1.2.3
Subgroups:
1: 1.2.3
Match 2: 4.5.66
Subgroups:
1: 4.5.66
Match 3: 77.88.99
Subgroups:
1: 77.88.99
Where as still missed one more matches.
Can anyone help me to provide the Regex. Thank you
\d+(?:\.\d+){0,2}
This is basically the same as Al's final regex - ((?:\d+\.){0,2}\d+) - but I think it's clearer this way. And there's no need to put parentheses around the whole regex. Assuming you're using Matcher.find() to get the matches, you can use group() or group(0) instead of group(1) to retrieve the matched text.
If you want to match up to three digits, you should try:
((?:\d+\.?){1,3})
The {1,3} part matches 1-3 of the preceding item (which is one or more digits followed by a literal .. Note that the dot is escaped so that it doesn't match any character.
Edit
Further explanation: The (?: ) part is a grouping that cannot be used for backreferences (tends to be faster), see section 4.3 here for more information. You could, of course, also just use ((\d+\.?){1,3}) if you prefer. For more information on {1,3}, see here under "Limiting Repetition".
Edit (2)
Fixed error pointed out by dtmunir. An alternative way that is a bit more explicit (and doesn't catch the extra "." at the end of the early groups) is:
((?:\d+\.){0,2}\d+)
Al that will not capture the 52. But this one in fact will:
((?:\d+\.?){1,3})
The only change is adding the question mark after the .
This allows it to accept the last number without having a period after it
Explanation (EDIT):
The \d+ as you can imagine captures consecutive digits.
The \. captures a period
The \.? captures a period, but allows the inner group to not require a period at the end
The (?:\d+\.?) defines "one group" which in your case you want to be 3 numbers.
The {1,3} sets the limits. It requires a minimum of 1 inner group and at most 3 inner groups. These groups may or may not end with a period.
This is my weird code for do this without regex :-)
public static String[] getTokens(String s) {
String[] splitted = s.split("\\.");
//Personally I hate Double.valueOf but I don't know how to avoid it
String[] result = new String[Double.valueOf(Math.ceil(Double.valueOf(splitted.length) / 3)).intValue()];
for (int i = 0, j = 0; j < splitted.length; i++, j+=3) {
//Weird concat
result[i] = splitted[j] + ( j+1 < splitted.length ? "." + splitted[j+1] : "" ) + ( j+2 < splitted.length ? "." + splitted[j+2] : "" );
}
return result;
}
Related
and thank you for your help,
I am trying to get a regex expression to decode a string with either a comma or semi-colon as anchor but I can't seem to get it to work for comma's or both. Please tell me what I'm missing or doing wrong. thanks!
^(?<FADECID>\d{6})?(?<MSG>([a-z A-Z 0-9 ()-:]*[;,]{1}+){8,}+)?(?<ANCH>\w*[;,])?(?<TIME>\d{4})?(?<FM>\d{2})?[;,]?(?<CON>.*)$.*
inbound type strings to decode - I need to treat the comma and or semicolon the same.
383154VSC X1;;;;;;;BOTH WASTE DRAIN VLV NOT CLSD (135MG;35MG);HARD;093502
282151FCMC1 1;;;;;;;FUEL MAIN PUMP1 (121QA1);HARD;093502
732112EEC2B 1;;;;;;;FMU(E2-4071KS)WRG:EEC J12 TO FMV LVDT POS,HARD;
383154VSC X1,,,,,,,BOTH WASTE DRAIN VLV NOT CLSD (135MG,35MG),HARD,093502
282151FCMC1 1,,,,,,,FUEL MAIN PUMP1 (121QA1);HARD;093502
732112EEC2B 1,,,,,,,FMU(E2-4071KS)WRG:EEC J12 TO FMV LVDT POS,HARD,
383154VSC X1,,,,,,,BOTH WASTE DRAIN VLV NOT CLSD (135MG;35MG);HARD;093502
282151FCMC1 1;;;;;;;FUEL MAIN PUMP1 (121QA1),HARD,093502
732112EEC2B 1,,,,,,,FMU(E2-4071KS)WRG:EEC J12 TO FMV LVDT POS;HARD;
This string has the possibility to contain mulitple text [;,] separated messages.
ABC;DEF;;HIJ;NNN;JJJ;XXX;EEX;HARD;
This manages that - (?([a-z A-Z 0-9 ()-:]*[;,]{1}+){8,}+)?
but it doesn't observe commas?
This works for ; but not for comma or both, my problem is that it can be both a semi-colon or a comma?
if I make the regex only comma, it works for comma strings, I know i'm missing a quantifier or something like.
if ( null != MORE && ! MORE.isEmpty() ) {
while ( null != MORE && ! MORE.isEmpty() || MORE.trim().equals("EOR")) {
LOG.info("MORE CONTINUE: " + MORE);
if ( MORE.trim().equals("EOR") ) {
break;
}
String patternMoreString = "^(?<FADECID>\\d{6})?(?<MSG>([a-z A-Z 0-9 ()-:()]*[;,]{1}+){8,}+)+?(?<ANCH>\\w*[;,])?(?<TIME>\\d{4})?(?<FM>\\d{2})?[;,]?(?<CON>.*)$.*";
Pattern patternMore = Pattern.compile(patternMoreString, Pattern.DOTALL);
Matcher matcherMore = patternMore.matcher(MORE);
while ( matcherMore.find() ) {
MORE = matcherMore.group("CON");
summary.setReportId("FLR");
summary.setAreg(Areg);
summary.setShip(Ship);
summary.setOrig(Orig);
summary.setDest(Dest);
summary.setTimestamp(Ts);
summary.setAta(matcherMore.group("FADECID"));
summary.setTime(matcherMore.group("TIME"));
summary.setFm(matcherMore.group("FM"));
summary.setMsg(matcherMore.group("MSG"));
serviceRecords.add(summary);
LOG.info("*** A330 MPF MORE Record ***");
LOG.info(summary.getReportId());
LOG.info(summary.getAreg());
LOG.info(summary.getShip());
LOG.info(summary.getOrig());
LOG.info(summary.getDest());
LOG.info(summary.getTimestamp());
LOG.info(summary.getAta());
LOG.info(summary.getTime());
LOG.info(summary.getFm());
LOG.info(summary.getMsg());
summary = new A330PostFlightReportRecord();
}
}
}
}
//---
I need for all cases group 2 and if TIME and FM exists.
You could make use of a capturing group and a backreference using the number of that group to get consistent delimiters.
In this case the capturing group is ([;,]) which is the fourth group denoted by \4 matching either ; or ,
If you only need group 2 and if TIME and FM you can omit group ANCH
^(?<FADECID>\d{6})(?<MSG>([a-zA-Z0-9() -]*([;,])){7,})(?<TIME>\d{4})?(?<FM>\d{2})?\4?(?<CON>.*)$
Explanation
^ Start of string
(?<FADECID>\d{6}) Named group FADECID, match 6 digits
(?<MSG> Named group MSG
( Capture group 3
[a-zA-Z0-9() -]* Match 0+ times any of the lister
([;,]) Capture group 4, used as backreference to get consistent delimiters
){7,} Close group and repeat 7+ times
) Close group MSG
(?<TIME>\d{4})? Optional named group TIME, match 4 digits
(?<FM>\d{2})? Optional named group FM, match 2 digits
\4? Optional backreference to capture group 4
(?<CON>.*) Named group CON Match any char except a newline 0+ times
$ End of string
Regex demo
Note that group 3 the capture group itself is repeated, giving you the last value of the iteration, which will be HARD
I want to:
remove all whitespaces unless it's right before or after (0-1 space before and 0-1 after) the predefined keywords (for example: and, or, if then we leave the spaces in " and " or " and" or "and " unchanged)
ignore everything between quotes
I've tried many patterns. The closest I've come up with is pretty close, but it still removes the space after keywords, which I'm trying to avoid.
regex:
\s(?!and|or|if)(?=(?:[^"]*"[^"]*")*[^"]*$)
Test String:
if (ans(this) >= ans({1,2}) and (cond({3,4}) or ans(this) <= ans({5,6})), 7, 8) and {111} > {222} or ans(this) = "hello my friend and or " and(cond({1,2}) $1 123
Ideal result:
if (ans(this)>=ans({1,2}) and (cond({3,4}) or ans(this)<=ans({5,6})),7,8) and {111}>{222} or ans(this)="hello my friend and or " and(cond({1,2})$1123
I then can use str = str.replaceAll in java to remove those whitespaces. I don't mind doing multiple steps to get to the result, but I am not familiar with regex so kinda stuck.
any help would be appreciated!
Note: I edited the result. Sorry about that. For the space around keywords: shrunk to 1 if there are spaces. Either leave it or add 1 space if it's 0 (I just don't want "or ans" becomes "orans", but "and(cond" becomes "and (cond)" is fine (shrink to 1 space before and 1 space after if exists). Ignore everything between quotes.
You make an intelligent use of capturing groups. The general idea here would be
match_this|or_this|or_even_this|(but_capture_this)
In terms of a regular expression this could be
(?:(?:\s+(?:and|or|if)\s+)|"[^"]+")|(\s+)
You'd then need to replace the match only if the first capturing group is not empty.
See a demo on regex101.com (with (*SKIP*)(*FAIL) which serves the same purpose).
You may use
String example = " if (ans(this) >= ans({1,2}) and (cond({3,4}) or ans(this) <= ans({5,6})), 7, 8) and {111} > {222} or ans(this) = \"hello my friend and or \" and(cond({1,2}) $1 123 ";
String rx = "\\s*\\b(and|or|if)\\b\\s*|(\"[^\"]*\")|(\\s+)";
Matcher m = Pattern.compile(rx).matcher(example);
example = m.replaceAll(r -> r.group(3) != null ? "" : r.group(2) != null ? r.group(2) : " " + r.group(1) + " ").trim();
System.out.println( example );
See the Java demo.
The pattern matches
\s*\b(and|or|if)\b\s* - 0+ whitespaces, word boundary, Group 1: and, or, if, word boundary and then 0+ whitespaces
| - or
(\"[^\"]*\") - Group 2: ", any 0+ chars other than " and then a "
| - or
(\s+) - Group 3: 1+ whitespaces.
If Group 3 matches, they are removed, if Group 2 matches, it is put back into the result and if Group 1 matches, it is wrapped with spaces and pasted back. The whole result is .trim()ed.
No question on SO addresses my particular problem. I know very little about regular expression. I am building an expression parser in Java using Regex Class for that purpose. I want to extract Operands, Arguments, Operators, Symbols and Function Names from expression and then save to ArrayList. Currently I am using this logic
String string = "2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)" //This is just for testing purpose later on it will be provided by user
List<String> res = new ArrayList<>();
Pattern pattern = Pattern.compile((\\Q^\\E|\\Q/\\E|\\Q-\\E|\\Q-\\E|\\Q+\\E|\\Q*\\E|\\Q)\\E|\\Q)\\E|\\Q(\\E|\\Q(\\E|\\Q%\\E|\\Q!\\E)) //This string was build in a function where operator names were provided. Its mean that user can add custom operators and custom functions
Matcher m = pattern.matcher(string);
int pos = 0;
while (m.find())
{
if (pos != m.start())
{
res.add(string.substring(pos, m.start()))
}
res.add(m.group())
pos = m.end();
}
if (pos != string.length())
{
addToTokens(res, string.substring(pos));
}
for(String s : res)
{
System.out.println(s);
}
Output:
2
!
+
atan2
(
3
+
9
,
2
+
3
)
-
2
*
PI
+
3
/
3
-
9
-
12
%
3
*
sin
(
9
-
9
)
+
(
2
+
6
/
2
)
Problem is that now Expression can contain Matrix with user defined format. I want to treat every Matrix as a Operand or Argument in case of functions.
Input 1:
String input_1 = "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6"
Output Should be:
2
+
3
-
9
*
[{2+3,2,6},{7,2+3,2+3i}]
+
9
*
6
Input 2:
String input_2 = "{[2,5][9/8,func(2+3)]}+9*8/5"
Output Should be:
{[2,5][9/8,func(2+3)]}
+
9
*
8
/
5
Input 3:
String input_3 = "<[2,9,2.36][2,3,2!]>*<[2,3,9][23+9*8/8,2,3]>"
Output Should be:
<[2,9,2.36][2,3,2!]>
*
<[2,3,9][23+9*8/8,2,3]>
I want that now ArrayList should contain every Operand, Operators, Arguments, Functions and symbols at each index. How can I achieve my desired output using Regular expression. Expression validation is not required.
I think you can try with something like:
(?<matrix>(?:\[[^\]]+\])|(?:<[^>]+>)|(?:\{[^\}]+\}))|(?<function>\w+(?=\())|(\d+[eE][-+]\d+)|(?<operand>\w+)|(?<operator>[-+\/*%])|(?<symbol>.)
DEMO
elements are captured in named capturing groups. If you don't need it, you can use short:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\}|\d+[eE][-+]\d+|\w+(?=\()|\w+|[-+\/*%]|.
The \[[^\]]+\]|<[^>]+>|\{[^\}]+\} match opening bracket ({, [ or <), non clasing bracket characters, and closing bracket (},],>) so if there are no nested same-type brackets, there is no problem.
Implementatin in Java:
public class Test {
public static void main(String[] args) {
String[] expressions = {"2!+atan2(3+9,2+3)-2*PI+3/3-9-12%3*sin(9-9)+(2+6/2)", "2+3-9*[{2+3,2,6},{7,2+3,2+3i}]+9*6",
"{[2,5][9/8,func(2+3)]}+9*8/5","<[2,9,2.36][2,3,2!]>*<[2,3,9][23 + 9 * 8 / 8, 2, 3]>"};
Pattern pattern = Pattern.compile("(?<matrix>(?:\\[[^]]+])|(?:<[^>]+>)|(?:\\{[^}]+}))|(?<function>\\w+(?=\\())|(?<operand>\\w+)|(?<operator>[-+/*%])|(?<symbol>.)");
for(String expression : expressions) {
List<String> elements = new ArrayList<String>();
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
elements.add(matcher.group());
}
for (String element : elements) {
System.out.println(element);
}
System.out.println("\n\n\n");
}
}
}
Explanation of alternatives:
\[[^\]]+\]|<[^>]+>|\{[^\}]+\} - match opening bracket of given
type, character which are not closing bracket of that type
(everything byt not closing bracket), and closing bracket of that
type,
\d+[eE][-+]\d+ = digit, followed by e or E, followed by operator +
or -, followed by digits, to capture elements like 2e+3
\w+(?=\() - match one or more word characters (A-Za-z0-9_) if it is
followed by ( for matching functions like sin,
\w+ - match one or more word characters (A-Za-z0-9_) for matching
operands,
[-+\/*%] - match one character from character class, to match
operators
. - match any other character, to match other symbols
Order of alternatives is quite important, as last alternative . will match any character, so it need to be last option. Similar case with \w+(?=\() and \w+, the second one will match everything like previous one, however if you don't wont to distinguish between functions and operands, the \w+ will be enough for all of them.
In longer exemple the part (?<name> ... ) in every alternative, is a named capturing group, and you can see in demo, how it group matched fragments in gorups like: operand, operator, function, etc.
With regular expressions you cannot match any level of nested balanced parentheses.
For example, in your second example {[2,5][9/8,func(2+3)]} you need to match the opening brace with the close brace, but you need to keep track of how many opening and closing inner braces/parens/etc there are. That cannot be done with regular expressions.
If, on the other hand, you simplify your problem to remove any requirement for balancing, then you probably can handle with regular expressions.
I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat
Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6
How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.
Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]
I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.
It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc
Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.
How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!
Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}
>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']
In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}