Simple java Regex doesn't - java

When i use this code, i don't have the expected result :
pattern = Pattern.compile("create\\stable\\s(\\w*)\\s\\(", Pattern.CASE_INSENSITIVE);
matcher = pattern.matcher("create table CONTACT (");
if(matcher.matches()) {
for(int i =0; i<matcher.groupCount();i++) {
System.out.println("table : " + matcher.group(i) + matcher.start(i) + " - " + matcher.end(i));
}
}
}
I expect to catch CONTACT but the regex catch the whole expression "create table CONTACT (".
Has someone an idea of the problem ?
Thanks

The regex engine actually counts the entire regex as a group. The first group in your regex is actually the second group returned by the match, which is at index 1.
If you ignore the first group, then you should find what you're looking for in the second.
The reason that the group isn't printed by your code is that groupCount doesn't count the entire regex as a group, so you're only getting 1 group in your loop.
Group zero denotes the entire pattern by convention. It is not included in this count.
You probably don't need a loop, and you can just extract the desired string directly with group(1).

Group number starts from 1, not from 0.
Following expression:
matcher.group(i)
should be replaced with:
matcher.group(i+1)
Or simply print group 1 if you want print only one group:
System.out.println("table: " + matcher.group(1));

Related

Index of each matcher group of a pattern in Java

I am matching certain contents of a file against a regex and getting groups out of it. How can I get the start and the end positions of each matched group?
Need the positions to replace those parts
Any suggestions please ?
You're looking for methods m.start(int groupId) and m.end(int groupId)
Java Docs:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Matcher.html#start(int)
In this case I would consider using named capture groups (?<GROUP-NAME>YOUR_REGEX) and methods m.start("GROUP-NAME") and m.end("GROUP-NAME"). This way when you change your input text or add/remove some groups, your group names are staying the same. :)
The following code prints the text matching the regular expression and the start and end position within the text:
String text = "a long text regex to match";
Matcher m = Pattern.compile("regex").matcher(text);
while (m.find()){
String found = m.group();
System.out.println(found + " " + m.start() + " " + m.end());
}
You can directly replace your desired content with replaceAll function:
This method replaces each substring of this string that matches the given regular expression with the given replacement.
Then, you can use it like:
replaceAll("[0-9]", "X");
Hope it helps you!

How to capture all nested matches?

I was trying to answer a question recently and while attempting to solve it, I ran into a question of my own.
Given the following code
private void regexample(){
String x = "a3ab4b5";
Pattern p = Pattern.compile("(\\D+(\\d+)\\D+){2}");
Matcher m = p.matcher(x);
while(m.find()){
for(int i=0;i<=m.groupCount();i++){
System.out.println("Group " + i + " = " + m.group(i));
}
}
}
And the output
Group 0 = a3ab4b
Group 1 = b4b
Group 2 = 4
Is there any straight-forward way I'm missing to get the value 3? The pattern should look for two occurrences of (\\D+(\\d+)\\D+) back-to-back, and a3a is part of the match. I realize I can change expression to (\\D+(\\d+)\\D+) and then look for all matches, but that isn't technically the same thing. Is the only way to do a double search? ie: Use the given pattern to match the string and then search again for each count of the outer group?
I guessed that the first values were overwritten with the second, but as I'm not that great with regex, I was hoping there was something I was missing.
It is impossible to capture multiple occurrences of the same group (with standard regex engines). You could use something like this:
Pattern.compile("(\\D+(\\d+)\\D+)(\\D+(\\d+)\\D+)");
Now, there are four groups instead of two, so you will get the values you expected.
This question deals with a similar problem.

Extracting SQL data using regular expression

I have the un proper data in this way. I need to extract the data before dot and after dot symbol using regular expression. I am using but I am not able to get exact data.
String rightHeading=null;
String leftHeading=null;
String formulaData="ifnull(\"Content Status\".\"Week Of Quarter\",0)";
Matcher matcher = Pattern.compile("(\"?([^()]*?)\"?)\\.(\"?([##$%><{}\\w ]*)\"?)").matcher(formulaData);
while (matcher.find())
{
String Column_Data=matcher.group(0);
String[] pieces = Column_Data.split("\\.");
rightHeading=pieces[0].replace("\"", "");
leftHeading=pieces[1].replace("\"", "");
System.out.println(rightHeading+ ": "+leftHeading);
}//while
Output which I got is:
ifnullContent Status.Week Of Quarter,0)
Expected output:
Content Status.Week Of Quarter
Below is my solution for your problem, along with the output that it produces.
String formulaData="(100*(FILTER(\"Fact - Bookings\".\"$ Total Gross Bookings\" USING (\"Booking Date\".\"Year\" = VALUEOF(\"CUR_YEAR\"))) - FILTER(Fact - Bookings.$ Total Gross BookingsData USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".Sortable Number <= VALUEOF(\"PRV_YEAR_TD\") ) ) / FILTER(Fact - Bookings.$TotalGrossBookingsUsage \" USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".\"Sortable Number\" <= VALUEOF(\"PRV_YEAR_TD\") ) )";
String p1 = "(\"(\\w*\\s*-*)*?\"\\.\".*?\")|((?:\\()((\\w*\\s*-*)*?\\.\\$\\w+))|(\"(\\w*\\s*-*)*?\"\\.(\\w+\\s+)+)";
Pattern p = Pattern.compile(p1);
Matcher m = p.matcher(formulaData);
while(m.find())
{
System.out.println(m.group(0).replaceAll("\"|\\(|\\)", ""));
}
Outputs:
Fact - Bookings.$ Total Gross Bookings
Booking Date.Year
Fact - Bookings.$ Total Gross BookingsData
Booking Date.Year
Booking Date.Sortable Number
Fact - Bookings.$TotalGrossBookingsUsage
Booking Date.Year
Booking Date.Sortable Number
As you can see, I didn't use actually use a horrifically complex regex to solve your problem. This is because your input is far too varied to use this tool effectively.
The fact that your table.field pairs sometimes had $ or " symbols inside them made the data very inconsistent. Regular expressions find it hard to deal with this level of complexity, so I think my solution (in this example) is workable.
However, in future if you have any control over your data input, please try to sanitize it and make it as consistent as possible.
EDIT
Since that didn't work out for you, I've gone and changed my code snippet to use a regular expression.
Matcher matcher = Pattern.compile("([\\w[\\$##\\-^&]\\w\\[\\]' $]+)\\.([\\w\\[\\]' $]+)").matcher(formulaData);
while (matcher.lookingAt()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
lookingAt() is more suitable here as per the requirements and as mentioned in doc --
lookingAt() Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Hope this helps.

Java regex to parse any number of Markdown-style links

I'm trying to parse a string for any occurrences of markdown style links, i.e. [text](link). I'm able to get the first of the links in a string, but if I have multiple links I can't access the rest. Here is what I've tried, you can run it on ideone:
Pattern p;
try {
p = Pattern.compile("[^\\[]*\\[(?<text>[^\\]]*)\\]\\((?<link>[^\\)]*)\\)(?:.*)");
} catch (PatternSyntaxException ex) {
System.out.println(ex);
throw(ex);
}
Matcher m1 = p.matcher("Hello");
Matcher m2 = p.matcher("Hello [world](ladies)");
Matcher m3 = p.matcher("Well, [this](that) has [two](too many) keys.");
System.out.println("m1 matches: " + m1.matches()); // false
System.out.println("m2 matches: " + m2.matches()); // true
System.out.println("m3 matches: " + m3.matches()); // true
System.out.println("m2 text: " + m2.group("text")); // world
System.out.println("m2 link: " + m2.group("link")); // ladies
System.out.println("m3 text: " + m3.group("text")); // this
System.out.println("m3 link: " + m3.group("link")); // that
System.out.println("m3 end: " + m3.end()); // 44 - I want 18
System.out.println("m3 count: " + m3.groupCount()); // 2 - I want 4
System.out.println("m3 find: " + m3.find()); // false - I want true
I know I can't have repeating groups, but I figured find would work, however it does not work as I expected it to. How can I modify my approach so that I can parse each link?
Can't you go through the matches one by one and do the next match from an index after the previous match? You can use this regex:
\[(?<text>[^\]]*)\]\((?<link>[^\)]*)\)
The method Find() tries to find all matches even if the match is a substring of the entire string. Each call to find gets the next match. Matches() tries to match the entire string and fails if it doesn't match. Use something like this:
while (m.find()) {
String s = m.group(1);
// s now contains "BAR"
}
The regular expression I've used to match what you need (without groups) is \[\w+\]\(.+\)
It is just to show you it simple. Basically it does:
Filter a square: \[
Followed by any word char (at least 1): \w+
Then the square: \]
This will look for these pattern [blabla]
Then the same with parenthesis...
Filter a parenthesis: \(
Followed by any char (at least 1): .+
Then the parenthesis: \)
So it filters (ble...ble...)
Now if you want to store the matches on groups you can use additional parenthesis like this:
(\[\w+\])(\(.+\)) in this way you can have stored the words and links.
Hope to help.
I've tried on regexplanet.com and it's working
Update: workaround .*(\[\w+\])(\(.+\))*.*

How to split a string which contains multiple key value pairs

I have a string:
Single line : Some text
Multi1: multi (Va1) Multi2 : multi (Va2) Multi3 : multi (Val3)
Dots....20/12/2013 (EOY)
and I am trying to retrieve all the key value pairs. My first attempt
(Single line|Multi[0-9]{1}|Dots)( *:? [.] *| *:? )(.)
seems to work but does not handle multiple key value pairs on one line. Is there any way to achieve this?
Try this:
String text = "Single line : Some text\r\n" +
"Multi1: multi (Va1) Multi2 : multi (Va2) Multi3 : multi (Val3)\r\n" +
"Dots....20/12/2013 (EOY)";
Pattern pattern = Pattern.compile("(\\p{Alnum}[\\p{Alnum}\\s/]+?)\\s?(:|\\.+)\\s?(\\p{Alnum}[\\p{Alnum}\\s/]+?)(?=($|\\()|(\\s\\())", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + "-->" + matcher.group(3));
}
Output:
Single line-->Some text
Multi1-->multi
Multi2-->multi
Multi3-->multi
Dots-->20/12/2013
Explanation:
I am limiting the keys and values to "starts with alphanumeric",
"contains any number of alphanumerics, spaces or slashes".
I am limiting the separator to "optional space, :, optional space" or
"optional space, any number of consecutive dots, optional space".
I am using groups 1 and 3 to define the key and value in the
Pattern.
Group 2 is used to provide alternate separators as above.
Finally, the Pattern is delimited at the end, either with a new
line, or with an open round bracket, or, with a space followed by an
open round bracket.
Note that you can't use quantifiers in a lookahead or lookbehind group, hence the repetition.
You can use this pattern:
public static void main(String[] args) {
String s = "Single line : Some text\n"
+ "Multi1: multi (Va1) Multi2 : multi (Va2) "
+ "Multi3 : multi (Val3)\n"
+ "Dots....20/12/2013 (EOY)";
String wd = "[^\\s.:]+(?:[^\\S\\n]+[^\\s.:]+)*";
Pattern p = Pattern.compile("(?<key>" + wd + ")"
+ "\\s*(?::|\\.+)\\s*"
+ "(?<value>" + wd + "(?:\\s*\\([^)]+\\))?)"
+ "(?!\\s*:)(?=\\s|$)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group("key")+"->"+m.group("value"));
}
}
I don't recall the exact syntax, but I think it's something like this:
while (matcher.find()) {
String match = matcher.group();
}
The goal here is that you need to iterate over the current line and tell it "while you are still finding stuff, return to me the string on this line that matched." Since you have multiple matches on the same line, it should keep pulling out findings for you. Here is the JavaDoc for Matcher as a reference.
This is sadly another reason why Java is really not well-suited for this sort of thing, and before anyone downmods me understand I say that as a criticism of the Java APIs here, not the language.

Categories