I was trying to answer a question recently and while attempting to solve it, I ran into a question of my own.
Given the following code
private void regexample(){
String x = "a3ab4b5";
Pattern p = Pattern.compile("(\\D+(\\d+)\\D+){2}");
Matcher m = p.matcher(x);
while(m.find()){
for(int i=0;i<=m.groupCount();i++){
System.out.println("Group " + i + " = " + m.group(i));
}
}
}
And the output
Group 0 = a3ab4b
Group 1 = b4b
Group 2 = 4
Is there any straight-forward way I'm missing to get the value 3? The pattern should look for two occurrences of (\\D+(\\d+)\\D+) back-to-back, and a3a is part of the match. I realize I can change expression to (\\D+(\\d+)\\D+) and then look for all matches, but that isn't technically the same thing. Is the only way to do a double search? ie: Use the given pattern to match the string and then search again for each count of the outer group?
I guessed that the first values were overwritten with the second, but as I'm not that great with regex, I was hoping there was something I was missing.
It is impossible to capture multiple occurrences of the same group (with standard regex engines). You could use something like this:
Pattern.compile("(\\D+(\\d+)\\D+)(\\D+(\\d+)\\D+)");
Now, there are four groups instead of two, so you will get the values you expected.
This question deals with a similar problem.
Related
I'm trying to create a lexical analyzer for Delphi using java. Here's the sample code:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
So, when I run the program, it works but it's re-reading certain words which the program considers as 2 token for example: record is a keyword, but re-reads it to find the word or for the token logical operators which is from rec"or"d. How can I cancel out the re-reading of words? Thanks!
Add \b to your regular expressions for breaks between words. So:
Pattern.compile("\\b" + keywords[i] + "\\b")
will ensure that the characters on either side of your word aren't letters.
This way "record" will only match with "record," not with "or."
As mentioned in answer by EvanM, you need to add a \b word boundary matcher before and after the keyword, to prevent substring matching within a word.
For better performance, you should also use the | logical regex operator to match one of many values, instead of creating multiple matchers, so you only have to scan the line once, and only have to compile one regex.
You can even combine the 3 different kinds of token you are looking for in a single regex, and use capture groups to differentiate them, so you only have to scan the line once in total.
Like this:
String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" +
"|(=|<[>=]?|>=?)" +
"|\\b(and|not|or|xor)\\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
You can even optimize it further by combining keywords with common prefixes, e.g. as|asm could become asm?, i.e. as optionally followed by m. Will make the keyword list less readable, but would perform better.
In the code above, I did that for the logic ops, to show how, and also to fix the matching error in the original code, where >= in the line would show up 3 times as =, >, >= in that order, a problem similar to the sub-keyword problem asked for in the question.
I've implemented quite a complicated pattern` to match all occurences of ship set number. It works perfectly fine with global case insensitive comparison.
I use the following code to implement the same thing in Java but it doesn't match. Should Java regex be implemented differently?
int i = 0;
while (i < elementsArray.size()) {
System.out.println("List element:"+elementsArray.get(i));
String theRegex = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
if (elementsArray.get(i).matches(theRegex)) {
System.out.println("RESULT:");
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
}
System.out.println("text==========" + shipsets);
}
i++;
}
Here is a simplification of your code which should work, assuming that your regex be working correctly in Java. From my preliminary investigations, it does seem to match many of the use cases in your link. You don't need to use String.matches() because you already are using a Matcher which will check whether or not you have a match.
List<String> elementsArray = new ArrayList<String>();
elementsArray.add("Shipset Number 323");
elementsArray.add("meh");
elementsArray.add("SS NO. : 34");
elementsArray.add("Mary had a little lamb");
elementsArray.add("Ship Set #2, #33 to #4.");
for (int i=0; i < elementsArray.size(); ++i) {
System.out.println("List element:"+elementsArray.get(i));
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
System.out.println("Found a match at element " + i + ": " + shipsets);
}
}
}
You can see in the output below, that the three ship test strings all matched, and the controls "meh" and "Mary had a little lamb" did not match.
Output:
List element:Shipset Number 323
Found a match at element 0: Shipset Number 323
List element:meh
List element:SS NO. : 34
Found a match at element 2: SS NO. : 34
List element:Mary had a little lamb
List element:Ship Set #2, #33 to #4.
Found a match at element 4: Ship Set #2, #33 to #4.
In my opinion your problems are coused by:
usage of matches() in if(elementsArray.get(i).matches(theRegex)) - matches() will return
true only if whole string match to regex, so it will succeed in
many cases from your example, but it will fail with:
SS#1,SS#5,SS#6, SS1, SS2, SS3, SS4, etc. You can simulate this
situation by adding ^ at beginning and $ at the end of regex.
Check how it match HERE. So it would be better solution, to use
matcher.find() instead of String.matches(), like in Tim
Biegeleisen answer.
usage of if(matcher.find()) instead of while(matcher.find()) - in
some of strings you want to retrieve more than one result, so you
should use matcher.find() multiple times, to get all of them.
However if will act only once, so you will get only first matched
fragment from given string. To retrieve all, use loop, as matcher.find() will return false when it will not find next match in given String, and will end loop
Check this out. This is Tim Biegeleisen solution with small change (while, instead of if).
I have the un proper data in this way. I need to extract the data before dot and after dot symbol using regular expression. I am using but I am not able to get exact data.
String rightHeading=null;
String leftHeading=null;
String formulaData="ifnull(\"Content Status\".\"Week Of Quarter\",0)";
Matcher matcher = Pattern.compile("(\"?([^()]*?)\"?)\\.(\"?([##$%><{}\\w ]*)\"?)").matcher(formulaData);
while (matcher.find())
{
String Column_Data=matcher.group(0);
String[] pieces = Column_Data.split("\\.");
rightHeading=pieces[0].replace("\"", "");
leftHeading=pieces[1].replace("\"", "");
System.out.println(rightHeading+ ": "+leftHeading);
}//while
Output which I got is:
ifnullContent Status.Week Of Quarter,0)
Expected output:
Content Status.Week Of Quarter
Below is my solution for your problem, along with the output that it produces.
String formulaData="(100*(FILTER(\"Fact - Bookings\".\"$ Total Gross Bookings\" USING (\"Booking Date\".\"Year\" = VALUEOF(\"CUR_YEAR\"))) - FILTER(Fact - Bookings.$ Total Gross BookingsData USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".Sortable Number <= VALUEOF(\"PRV_YEAR_TD\") ) ) / FILTER(Fact - Bookings.$TotalGrossBookingsUsage \" USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".\"Sortable Number\" <= VALUEOF(\"PRV_YEAR_TD\") ) )";
String p1 = "(\"(\\w*\\s*-*)*?\"\\.\".*?\")|((?:\\()((\\w*\\s*-*)*?\\.\\$\\w+))|(\"(\\w*\\s*-*)*?\"\\.(\\w+\\s+)+)";
Pattern p = Pattern.compile(p1);
Matcher m = p.matcher(formulaData);
while(m.find())
{
System.out.println(m.group(0).replaceAll("\"|\\(|\\)", ""));
}
Outputs:
Fact - Bookings.$ Total Gross Bookings
Booking Date.Year
Fact - Bookings.$ Total Gross BookingsData
Booking Date.Year
Booking Date.Sortable Number
Fact - Bookings.$TotalGrossBookingsUsage
Booking Date.Year
Booking Date.Sortable Number
As you can see, I didn't use actually use a horrifically complex regex to solve your problem. This is because your input is far too varied to use this tool effectively.
The fact that your table.field pairs sometimes had $ or " symbols inside them made the data very inconsistent. Regular expressions find it hard to deal with this level of complexity, so I think my solution (in this example) is workable.
However, in future if you have any control over your data input, please try to sanitize it and make it as consistent as possible.
EDIT
Since that didn't work out for you, I've gone and changed my code snippet to use a regular expression.
Matcher matcher = Pattern.compile("([\\w[\\$##\\-^&]\\w\\[\\]' $]+)\\.([\\w\\[\\]' $]+)").matcher(formulaData);
while (matcher.lookingAt()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
lookingAt() is more suitable here as per the requirements and as mentioned in doc --
lookingAt() Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Hope this helps.
Consider an input string like
Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5
and the regular expression
\b(TWO|FOUR)=([^ ]*)\b
Using this regular expression, the following code can extract the 2 specific key-value pairs out of the 5 total ones (i.e., only some predefined key-value pairs should be extracted).
public static void main(String[] args) throws Exception {
String input = "Number ONE=1 appears before TWO=2 and THREE=3 comes before FOUR=4 and FIVE=5";
String regex = "\\b(TWO|FOUR)=([^ ]*)\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println("\t" + matcher.group(1) + " = " + matcher.group(2));
}
}
More specifically, the main() method above prints
TWO = 2
FOUR = 4
but every time find() is invoked, the whole regular expression is evaluated for the part of the string remaining after the latest match, left to right.
Also, if the keys are not mutually distinct (or, if a regular expression with overlapping matches was used in the place of each key), there will be multiple matches. For instance, if the regex becomes
\b(O.*?|T.*?)=([^ ]*)\b
the above method yields
ONE = 1
TWO = 2
THREE = 3
If the regex was not fully re-evaluated but each alternative part was somehow examined once (or, if an appropriately modified regex was used), the output would have been
ONE = 1
TWO = 2
So, two questions:
Is there a more efficient way of extracting a selected set of unique keys and their values, compared to the original regular expression?
Is there a regular expression that can match every alternative part of the OR (|) sub-expression exactly once and not evaluate it again?
Java Returns a Match Position: You can Use Dynamically-Generated Regex on Remaining Substrings
With the understanding that it can be generalized to a more complex and useful scenario, let's take a variation on your first example: \b(TWO|FOUR|SEVEN)=([^ ]*)\b
You can use it like this:
Pattern regex = Pattern.compile("\\b(TWO|FOUR|SEVEN)=([^ ]*)\\b");
Matcher regexMatcher = regex.matcher(yourString);
if (regexMatcher.find()) {
String theMatch = regexMatcher.group();
String FoundToken = = regexMatcher.group(1);
String EndPosition = regexMatcher.end();
}
You could then:
Test the value contained by FoundToken
Depending on that value, dynamically generate a regex testing for the remaining possible tokens. For instance, if you found FOUR, your new regex would be \\b(TWO|SEVEN)=([^ ]*)\\b
Using EndPosition, apply that regex to the end of the string.
Discussion
This approach would serve your goal of not re-evaluating parts of the OR that have already matched.
It also serves your goal of avoiding duplicates.
Would that be faster? Not in this simple case. But you said you are dealing with a real problem, and it will be a valid approach in some cases.
Lets say that you want to match a string with the following regex:
".when is (\w+)." - I am trying to get the event after 'when is'
I can get the event with matcher.group(index) but this doesnt work if the event is like Veteran's Day since it is two words. I am only able to get the first word after 'when is'
What regex should I use to get all of the words after 'when is'
Also, lets say I want to capture someones bday like
'when is * birthday
How do I capture all of the text between is and birthday with regex?
You could try this:
^when is (.*)$
This will find a string that starts with when is and capture everything else to the end of the line.
The regex will return one group. You can access it like so:
String line = "when is Veteran's Day.";
Pattern pattern = Pattern.compile("^when is (.*)$");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
}
And the output should be:
group 1: when is Veteran's Day.
group 2: Veteran's Day.
If you want to allow whitespace to be matched, you should explicitly allow whitespace.
([\w\s]+)
However, roydukkey's solution will work if you want to capture everything after when is.
Don't use regular expressions when you don't need to!! Although the theory of regular expressions is beautiful in the thought that you can have a string do code operations for you, it is very memory inefficient for simple use cases.
If you are trying to get the word after "when is" ending by a space, you could do something like this:
String start = "when is ";
String end = " ";
int startLocation = fullString.indexOf(start) + start.length();
String afterStart = fullString.substring(startLocation, fullString.length());
String word = afterStart.substring(0, afterStart.indexOf(end));
If you know the last word is Day, you can just make end = "Day" and add the length of that string of where to end the second substring.
You can express this as a character class and include spaces in it: when is ([\w ]+).
\w only includes word characters, which doesn't include spaces. Use [\w ]+ instead.