Can't retrieve data from matched * group in Java - java

I'm having trouble figuring out the proper regex.
Here is some sample code:
#Test
public void testFindEasyNaked() {
System.out.println("Naked_find");
String arg = "hi mom <us-patent-grant seq=\"002\" image=\"D000001\" >foo<name>Fred</name></us-patent-grant> extra stuff";
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
System.out.println(nakedPat);
Pattern naked = Pattern.compile(nakedPat, Pattern.MULTILINE + Pattern.DOTALL );
Matcher m = naked.matcher(arg);
if (m.find()) {
System.out.println("found naked");
for (int i = 0; i <= m.groupCount(); i++) {
System.out.printf("%d: %s\n", i, m.group(i));
}
} else {
System.out.println("can't find naked either");
}
System.out.flush();
}
My regex matches the string, but I am not able to pull the repeated pattern.
What I want is to have
seq=\"002\" image=\"D000001\"
pulled out as a group. Here is what the program shows when I execute it.
Naked_find
<(us-patent-grant)((\s*[\S&&[^>]])*)*\s*>(.+?)</\1>
found naked
0: <us-patent-grant seq="002" image="D000001" >foo<name>Fred</name></us-patent-grant>
1: us-patent-grant
2:
3: "
4: foo<name>Fred</name>
The group #4 is fine, but where is the data for #2 and #3, and why is there a double quote in #3?
Thanks
Pat

Even if using an XML parser would be sound, I think I can explain the error in your regular expression:
String nakedPat = "<(us-patent-grant)((\\s*[\\S&&[^>]])*)*\\s*>(.+?)</\\1>";
You try to match the parameters in the part ((\\s*[\\S&&[^>]])*)*. Look at your innermost group: you have \s* ("one or more space") followed by \\S&&[^>] ("one non-space which is not >). It means that in your group, you will either have from zero to some spaces followed by a single non-space character.
So this will match any non-space character between "us-patent-grant" and >. And every time the regular expression engine will match it, it will assign the value to the group 3. It means the group previously matched are lost. That's why you have the last character of the tag, that is ".
You can improve it a bit by adding a + after [\\S&&[^>]], so it will match at least a complete sequence of non-spaces, but you would only obtain the last tag attribute in your group. You should instead use a better and simpler way:
Your goal being to pull out seq="002" image="D000001" in a group, what you should do is simply to match the sequence of every characters which are not > after "us-patent-grant":
"<(us-patent-grant)\\s*([^>]*)\\s*>(.+?)</\\1>"
This way, you have the following values in your groups:
Group 1: us-patent-grant
Group 2: seq=\"002\" image=\"D000001\"
Group 3: foo<name>Fred</name>
Here is the test on Regexplanet: http://fiddle.re/ezfd6

Related

Regex - Words containing multiple underscores

I'm making a Lexer, and have chosen to use Regex to split my tokens.
I'm working on all different tokens, except the one that really bugs me is words and identifiers.
You see, the rules I have in place are the following:
Words cannot start with or end with an underscore.
Words can be one or more characters in length.
Underscores can only be used between letters, and can appear multiple times.
Example of what I want:
_foo <- Invalid.
foo_ <- Invalid.
_foo_ <- Invalid.
foo_foo <- Valid.
foo_foo_foo <- Valid.
foo_foo_ <- Partially Valid. Only "foo_foo" should be picked up.
_foo_foo <- Partially Valid. Only "foo_foo" should be picked up.
I'm getting close, as this is what I currently have:
([a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+)
Except, it only detects the first occurence of an underscore. I want all of them.
Personal Request:
I would rather the answer be contained inside of a single group, as I have structured my tokeniser around them, except I would be more than happy to change my design if you can think of a better way of handling it. This is what I currently use:
private void tokenise(String regex, String[] data) {
Set<String> tokens = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(regex);
// First pass. Uses regular expressions to split data and catalog token types.
for (String line : data) {
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
for (int i = 1; i < matcher.groupCount() + 1; i++) {
if (matcher.group(i) != null) {
switch(i) {
case (1):
// Example group.
// Normally I would structure like:
// 0: Identifiers
// 1: Strings
// 2-?: So on so forth.
tokens.add("FOO:" + matcher.group());
break;
}
}
}
}
}
}
Try ([a-zA-Z]+(?:_[a-zA-Z]+)*)
The first part of the pattern, [a-zA-Z]+, matches one or more letters.
The second part of the pattern, (?:_[a-zA-Z]+), matches an undescore if it is followed by one or more letters.
The * at the end means the second part can be repeated zero or more times.
The (?: ) is like plain (), but doesn't return the matched group.

Java, Regex, Nested optional groups

I'm trying to capture nested optional groups in Java but it's not working out.
I'm trying to capture a keyword followed by an interval, where a keyword is anything for now, and an interval is just two dates. The interval may be optional, and the two dates may be optional as well. So, the following are valid matches.
word
word [01/01/1900, ]
word [, 01/01/2000]
word [01/01/1900, 01/01/2000]
I want to capture the keyword and both the dates even if they are null.
This is the Java MWE I've came up with.
public class Parser {
public static void main(String[] args) {
Parser parser = new Parser();
String s = "word [01/01/1900, 01/01/2000]";
parser.parse(s);
}
public void parse(String s) {
String date = "\\d{2}/\\d{2}/\\d{4}";
String interval = "\\[("+date+")?, ("+date+")?\\]";
String keyword = "(.+)( "+interval+")?";
Pattern p = Pattern.compile(keyword);
Matcher m = p.matcher(s);
if (m.matches()) {
for (int i = 0; i <= m.groupCount(); ++i) {
System.out.println(i + ": " + m.group(i));
}
}
}
}
And this is the output
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
4: null
If interval isn't optional, then it works.
String keyword = "(.+)( "+interval+")";
0: word [01/01/1900, 01/01/2000]
1: word
2: [01/01/1900, 01/01/2000]
3: 01/01/1900
4: 01/01/2000
If interval is a non-matching group (but still optional), then it doesn't work.
String keyword = "(.+)(?: "+interval+")?";
0: word [01/01/1900, 01/01/2000]
1: word [01/01/1900, 01/01/2000]
2: null
3: null
What do I need to do to retrieve back both dates? Thank You.
Edit: Part 2.
Suppose now I watch to match repeated keywords. i.e. the regex, keyword(, keyword)*. I tried this out, but only the first and the last instance is captured.
For simplicity, suppose I want to match the following a, b, c, d with the regex ([a-z])(?:, ([a-z]))*
However, I can only retrieve back the first and last group.
0: a, b, c, d
1: a
2: d
Why is this so?
Just found out that this cannot be done. Capture group multiple times
Change the first part of keyword from (.+) to (.+?).
Without the ?, the (.+) is a greedy quantifier. That means it will try to match as much as it can. I don't know all the mechanics of how the regex engine works, but I believe that in your case, what it's doing is setting some counter N to the number of characters remaining in the source. If it can use up that many characters and get the whole regex to match, it will. Otherwise, it tries N-1, N-2, etc., until the entire regex matches. I also think it goes from left to right when trying this; that is, since (.+) is the leftmost "part" of the pattern (for some definition of "part"), it loops on that part before it tries any looping on parts that are to the right. Thus, it's more important to make (.+) greedy than to make any other part of the pattern greedy; the (.+) takes precedence.
In your case, since (.+) is followed by an optional part, the regex matcher starts by trying the entire remainder of the string--and it succeeds, because the rest of the string, which is empty, is a fine match for an optional substring. That should also explains why it doesn't work if your substring isn't optional--the empty substring no longer matches.
Adding ? makes it a "reluctant" (or "stingy") quantifier, which works in the opposite direction. It starts by seeing if it can make a match with 0 characters, then 1, 2, ..., instead of starting with N and going downward. So when it gets up to 5, matching "word ", and it finds that the rest of the string matches your optional part, it completes and gives the results you were expecting.

Regular expression for a string starting with some string

I have some string, that has this type: (notice)Any_other_string (notes that : () has in this string`.
So, I want to separate this string to 2 part : (notice) and the rest. I do as follow :
private static final Pattern p1 = Pattern.compile("(^\\(notice\\))([a-z_A-Z1-9])+");
String content = "(notice)Stack Over_Flow 123";
Matcher m = p1.matcher(content);
System.out.println("Printing");
if (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
I hope the result will be (notice) and Stack Over_Flow 123, but instead, the result is : (notice)Stack and (notice)
I cannot explain this result. Which regex is suitable for my purpose?
Issue 1: group(0) will always return the entire match - this is specified in the javadoc - and the actual capturing groups start from index 1. Simply replace it with the following:
System.out.println(m.group(1));
System.out.println(m.group(2));
Issue 2: You do not take spaces and other characters, such as underscores, into account (not even the digit 0). I suggest using the dot, ., for matching unknown characters. Or include \\s (whitespace) and _ into your regex. Either of the following regexes should work:
(^\\(notice\\))(.+)
(^\\(notice\\))([A-Za-z0-9_\\s]+)
Note that you need the + inside the capturing group, or it will only find the last character of the second part.

ReGex patten to match <c:if > conditional variable names?

I need to get conditional variable name for all cases in a particular jsp
I am reading the jsp line by line and searching for particular pattern like for a line say its checking two type of cond where it finds the match
<c:if condition="Event ='Confirmation'">
<c:if condition="Event1 = 'Confirmation' or Event2 = 'Action'or Event3 = 'Check'" .....>
Desired Result is name of all cond variable - Event,Event1,Event2,Event3 I have written a parser that only satisfying the first case But not able to find variable names for second case.Need a pattern to satisfy both of them.
String stringSearch = "<c:if";
while ((line = bf.readLine()) != null) {
// Increment the count and find the index of the word
lineCount++;
int indexfound = line.indexOf(stringSearch);
if (indexfound > -1) {
Pattern pattern = Pattern
.compile(test=\"([\\!\\(]*)(.*?)([\\=\\)\\s\\.\\>\\[\\(]+?));
Matcher matcher = pattern.matcher(line);
if (matcher.find()) {
str = matcher.group(1);
hset.add(str);
counter++;
}
}
If I understood your requirement well, this may work :
("|\s+)!?(\w+?)\s*=\s*'.*?'
$2 will give each condition variable name.
What it does is:
("|\s+) a " or one or more spaces
!? an optional !
(\w+?) one or more word character (letter, digit or underscore) (([A-Za-z]\w*) would be more correct)
\s*=\s* an = preceded and followed by zero or more spaces
'.*?' zero or more characters inside ' and '
Second capture group is (\w+?) retrieving the variable name
Add required escaping for \
Edit: For the additional conditions you specified, the following may suffice:
("|or\s+|and\s+)!?(\w+?)(\[\d+\]|\..*?)?\s*(!?=|>=?|<=?)\s*.*?
("|or\s+|and\s+) A " or an or followed by one or more spaces or an and followed by one or more spaces. (Here, it is assumed that each expression part or variable name is preceded by a " or an or followed by one or more spaces or an and followed by one or more spaces)
!?(\w+?) An optional ! followed by one or more word character
(\[\d+\]|\..*?)? An optional part constituting a number enclosed in square brackets or a dot followed by zero or more characters
(!?=|>=?|<=?) Any of the following relational operators : =,!=,>,<,>=,<=
$2 will give the variable name.
Here second capture group is (\w+?) retrieving variable name and third capture group retrieves any suffix if present (eg:[2] in Event[2]).
For input containing a condition Event.indexOf(2)=something, $2 gives Event only. If you want it to be Event.indexOf(2) use $2$3.
This could suit your needs:
"(\\w+)\\s*=\\s*(?!\")"
Which means:
Every word followed by a = that isn't followed by a "
For example:
String s = "<c:if condition=\"Event ='Confirmation'\"><c:if condition=\"Event1 = 'Confirmation' or Event2 = 'Action'or Event3 = 'Check'\" .....>";
Pattern p = Pattern.compile("(\\w+)\\s*=\\s*(?!\")");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group(1));
}
Prints:
Event
Event1
Event2
Event3

Regex for numeric portion of Java string

I'm trying to write a Java method that will take a string as a parameter and return another string if it matches a pattern, and null otherwise. The pattern:
Starts with a number (1+ digits); then followed by
A colon (":"); then followed by
A single whitespace (" "); then followed by
Any Java string of 1+ characters
Hence, some valid string thats match this pattern:
50: hello
1: d
10938484: 394958558
And some strings that do not match this pattern:
korfed49
: e4949
6
6:
6:sdjjd4
The general skeleton of the method is this:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
// Else, return null.
}
Here's my best attempt so far, but I know I'm wrong:
public String extractNumber(String toMatch) {
// If toMatch matches the pattern, extract the first number
// (everything prior to the colon).
String regex = "???";
if(toMatch.matches(regex))
return toMatch.substring(0, toMatch.indexOf(":"));
// Else, return null.
return null;
}
Thanks in advance.
Your description is spot on, now it just needs to be translated to a regex:
^ # Starts
\d+ # with a number (1+ digits); then followed by
: # A colon (":"); then followed by
# A single whitespace (" "); then followed by
\w+ # Any word character, one one more times
$ # (followed by the end of input)
Giving, in a Java string:
"^\\d+: \\w+$"
You also want to capture the numbers: put parentheses around \d+, use a Matcher, and capture group 1 if there is a match:
private static final Pattern PATTERN = Pattern.compile("^(\\d+): \\w+$");
// ...
public String extractNumber(String toMatch) {
Matcher m = PATTERN.matcher(toMatch);
return m.find() ? m.group(1) : null;
}
Note: in Java, \w only matches ASCII characters and digits (this is not the case for .NET languages for instance) and it will also match an underscore. If you don't want the underscore, you can use (Java specific syntax):
[\w&&[^_]]
instead of \w for the last part of the regex, giving:
"^(\\d+): [\\w&&[^_]]+$"
Try using the following: \d+: \w+

Categories