I'm trying to get matches for commands like this;
[AUTR| <version_software> | <version_protocol> | <msg> ]
[PING]
What is the regular expression that find this matches for the first command?
AUTR
version_software
version_protocol
msg
this is the code that parse that:
String[] tokens = msg.replace('<',' ').replace('>',' ').replace('[', ' ').replace(']', ' ').split("\\|");
for (int i=0; i<tokens.length; i++) tokens[i] = tokens[i].trim();
I'm only wondering how it can be done with a regex solution.
EDIT:
I'm trying to match groups with easier expressions, and with this code the call to m.groupCount returns one... but when I try to print it... it throws this exception "java.lang.IllegalStateException: No match found"
Pattern pattern = Pattern.compile("([\\w+])");
Matcher m = pattern.matcher("[AUTR]");
for (int i=0; i<m.groupCount();i++)
{
System.out.println(m.group(i));
}
EDIT:
http://fiddle.re/6ykc
Regular Expression:
\[([\w]+)(\s*\|\s*<([\w. ]+)>\s*)*\]
Java Regex String:
"\\[([\\w]+)(\\s*\\|\\s*<([\\w. ]+)>\\s*)*\\]"
Note that this is for variable commands now and that all extra parameters must match the following character set [a-zA-Z_0-9. ] (Includes periods and spaces).
Issue: There is an issue with variable length commands that you cannot capture more than one group with a variable type grouping.
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
EDIT 2:
In order to get all of them you can do 2 regular expressions, one to grab the command:
String command_regex = "\\[([\\w]+)";
And find that and then find the parameters which you can use the <> as your key character to select:
String parameters = "<([\\w. ]+)>";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(string_to_match);
while (matcher.find()) {
System.out.println(matcher.group());
}
Hope that helps.
ORIGINAL:
Not exactly sure on the formatting, are the "<" and ">" and "|" required? And what are the formats for the command, version_software, version_protocol and message? This is my attempt though for regular expressions (tested in Python)
\[(\w+)\s*\|\s*<([\w.]+)>\s*\|\s*<(\w+)>\s*\|\s*<([\w\s]+)>\s*\]
You need to make sure to escape the brackets and the pipe symbols (I added \s* conditions between because I don't know if there will be spaces or not. If you do:
>> search.re("expression above", line)
>> search.groups()
It should give all tokens in python at least. I left it more hardcoded to allow room for adjustments on each token you wanted to grab, otherwise you could reduce the last 3 parts by making it a group and saying to repeat 3 times. Let me know results?
Related
Why do I get two matches when using the regular expression .* on the string abcd 1234 abcd? See https://regex101.com/r/rV8jfz/1.
From the explanation given by regex101, I can see that the second match happened at position 14-14 and the value matched is null. But why is a second match done? Is there a way that I can avoid the second match?
I understand .* means zero or more of any character, so it's trying to find zero occurrences. But I don't understand why this null match is required.
The problem is when used in any language (e.g. Java), when I do while(matcher.find()) { ... }, this would loop twice while I would want it to loop only once.
I know this could not be a real world match situation, but to understand and explore regex, I see this as a good case to study.
Edit - follwing #terdon response.
I did like to keep the /g option in regex101, i am aware about it. I would like to know the total possible matches.
https://regex101.com/r/EvOoAr/1 -> pattern abcd against string abcd 1234 abcd gives two matches. And i wan't to know this information.
the problem i find is, when dealing this in a language like java -
Ref - https://onecompiler.com/java/3xnax494k
String str = "abcd 1234 abcd";
Pattern p = Pattern.compile(".*");
Matcher matcher = p.matcher(str);
int matchCount=0;
while(matcher.find()) {
matchCount++;
System.out.println("match number: " + matchCount);
System.out.println("matcher.groupCount(): " + matcher.groupCount());
System.out.println("matcher.group(): " + matcher.group());
}
The output is -
match number: 1
matcher.groupCount(): 0 //you can ignore this
matcher.group(): abcd 1234 abcd
match number: 2
matcher.groupCount(): 0
matcher.group(): //this is my concern. The program has to deal with this nothing match some how.
It would be nice for me as a programmer, if the find() did not match against "nothing". I should add additional code in the loop to catch this "nothing" case.
This null problem (in code) will get even worse with this regex case - https://regex101.com/r/5HuJ0R/1 -> [0-9]* against abcd 1234 abcd gives 12 matches.
The reason you get two matches is because you are using the g (global) operator. If you remove that from your regex101 example, you will only get one match.
This happens because the global operator makes the regex engine try to find as many matches on the string as possible. Since the expression .* matches everything, it also matches nothing, i.e. the empty string. Therefore, the first match is the entire string and then the second match is matching the "nothing" that comes after, it is matching an empty string. Removing the g will make it stop at the first match, the entire string, and not try to find others:
here is my code
String a = "X^5+2X^2+3X^3+4X^4";
String exp[]=a.split("(|\\+\\d)[xX]\\^");
for(int i=0;i<exp.length;i++) {
System.out.println("exp: "+exp[i]+" ");
}
im try to find the output which is 5,2,3,4
but instead i got this answer
exp:
exp:5
exp:2
exp:3
exp:4
i dont know where is the first line space come from, and i cannot find a will to get rid of that, i try to use others regex for this and also use compile,still can get rid of the first line, i try to use new string "X+X^5+2X^2+3X^3+4X^4";the first line shows exp:X.
and i also use online regex compiler to try my problem, but their answer is 5,2,3,4, buy eclipse give a space ,and then 5,2,3,4 ,need a help to figure this out
Try to use regex, e.g:
String input = "X^5+2X^2+3X^3+4X^4";
Pattern pattern = Pattern.compile("\\^([0-9]+)");
Matcher matcher = pattern.matcher(input);
for (int i = 1; matcher.find(); i++) {
System.out.println("exp: " + matcher.group(1));
}
It gives output:
exp: 5
exp: 2
exp: 3
exp: 4
How does it work:
Pattern used: \^([0-9]+)
Which matches any strings starting with ^ followed by 1 or more digits (note the + sign). Dash (^) is prefixed with backslash (\) because it has a special meaning in regular expressions - beginning of a string - but in Your case You just want an exact match of a ^ character.
We want to wrap our matches in a groups to refer to them late during matching process. It means we need to mark them using parenthesis ( and ).
Then we want to pu our pattern into Java String. In String literal, \character has a special meaning - it is used as a control character, eg "\n" represents a new line. It means that if we put our pattern into String literal, we need to escape a \ so our pattern becomes: "\\^([0-9]+)". Note double \.
Next we iterate through all matches getting group 1 which is our number match. Note that a ^.character is not covered in our match even if it is a part of our pattern. It is so because wr used parenthesis to mark our searched group, which in our case are only digits
Because you are using the split method which looks for the occurrence of the regex and, well.. splits the string at this position. Your string starts with X^ so it very much matches your regex.
I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F
Please could someone explain this for me:
We have a regular expression which we use to check if a string matches a specific sequence. The regular expression is shown below:
JPRN(JAPICCTI\d{6})|(JAPICCTI\d{6})
I want to try and understand what this code is trying to achieve:
matcher = Pattern.compile("JPRN(JAPICCTI\d{6})|(JAPICCTI\d{6})");
Matcher m = matcher.matcher("JAPICCTI132323");
if(m.find()){
Matcher m2 = matcher.matcher(m.group());
if(m2.find()){
return m2.replaceAll("$1")
}
}
The string it tries to check (i.e. JAPICCTI132323) does match with the regular expression.
I dont however understand why the matching is done twice i.e. using the string and again using the "group". What would be the reason for doing this?
And also what is the purpose of the $1 string.
This is failing because the m2.replaceAll("$1") is returning an empty string but i was expecting it to return JAPICCTI132323. Given that i dont understand what it is doing i am struggling to understand why the result is an empty string
Thanks in advance.
The | symbol indicates alternation which means "Match the left group first, if it does not match, try the second group"
The $1 symbol represents what was matched, in this case it would simply replace itself with itself.
If you have a number of capture groups: (one\d+)(two\w+\d)(three.*?)
Then you could use $1, $2 and $3 to represent the matched strings.
In other regex implementations you can name a capture group like so: (?<first match>regexpattern) or (?<phone number>\d{2}\s\d{4}) but unfortunately in Java, it is not available.
You might have to do some testing, but you might be able to specify $1$2 as the replacement, since if one of them is null, it won't add anything but the other match will.
But if both match, it will cause issues because you will have two strings in your replacement.
I am trying to solve a simple Java regex matching problem but still getting conflicting results (following up on this and that question).
More specifically, I am trying to match a repetitive text input, consisting of groups that are delimited by '|' (vertical bar) that may be directly preceded by underscore ('_'), especially if the groups are not empty (i.e., if no two consecutive | delimiters appear in the input).
An example such input is:
Text group 1_|Text group 2_|||Text group 5_|||Text group 8
In addition, I need a way to verify that a match has occurred, in order to avoid applying the processing related to that input to other, totally different inputs that my application also processes, using different regular expressions.
To confirm that a regex works, I am using RegexPal.
After several tests, the closest to what I want are the following two Regular Expressions, suggested in the questions I quoted above:
1. (?:\||^)([^\\|]*)
2. \G([^\|]+?)_?\||\G()\||\G([^\|]*)$
Using either of these, if I run a matcher.find() loop I get:
All the text groups, with the underscore included in the end, from Regex 1
All the text groups apart from the last, with no underscore but 2 empty groups in the end, from Regex 2.
So, apparently Regex 2 is not correct (and RegexPal also does not show it as matching).
I could use Regex 1 and do some post-processing to remove the trailing underscore, although ideally I would like the regex to do that for me.
However, none of the two aforementioned regular expressions returns true for matcher.matches(), whereas matcher.find() is always true even for totally irrelevant input (reasonable, since there will often be at least 1 matching group, even in other text).
I thus have two questions:
Is there a correct (fully working) regex that excludes the trailing underscore?
Is there any way of checking that only the correct regex has matched?
The code used to test Regex 1, is something like
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher = Pattern.compile("(?:\\||^)([^\\\\|]*)").matcher(input);
if (matcher.matches())
{
System.out.println("Input MATCHED: " + input);
while (matcher.find())
{
System.out.println("\t\t" + matcher.group(1));
}
}
else
{
System.out.println("\tInput NOT MATCHED: " + input);
}
Using the above code always results in "NOT MATCHED". Removing the if/else and only using matcher.find() does retrieve all text groups.
Matcher#matches method attempts to match the entire input sequence against the pattern, that is why you are getting the result Input NOT MATCHED. See the documentation here http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Matcher.html#matches
If you want to exclude the trailing underscore you can use this regex (slight modification of what you already have)
(?:\\||^)([^\\\\|_]*)
This would work if you are sure that _ comes just before |.
RegexPal is a JavaScript regex tool. The Java and JavaScript regular expression languages differ. Consider using a Java Regex tool; perhaps this one
This may be close to what you want: (?:([^_\|]+)_{0,1}+\|*)+
Edit: Code added.
In java 6 this prints each group (the find() loop).
public static void main(String[] args)
{
String input = "Text group 1_|Text group 2_|||Text group 5_|||Text group 8";
Matcher matcher;
Pattern pattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)+");
Pattern groupPattern = Pattern.compile("(?:([^_\\|]+)_{0,1}+\\|*)");
matcher = pattern.matcher(input);
if (matcher.matches())
{
Matcher groupMatcher;
System.out.println("matcher.matches() is true");
int groupCount = matcher.groupCount();
for (int index = 1; index <= groupCount; ++index)
{
System.out.print("group (pattern)[");
System.out.print(index);
System.out.print("]: ");
System.out.println(matcher.group(index));
}
groupMatcher = groupPattern.matcher(input);
while (groupMatcher.find())
{
System.out.print("group (groupPattern):");
System.out.println(groupMatcher.group());
System.out.println(groupMatcher.group(1));
}
}
else
{
System.out.println("No match");
}
}