Parse Drools rule file with Java regex - java

I'm interested in parsing a Drools rule file using regular expressions.
Having a string with the content of the whole .drl file, I'd like to have 4 substrings:
A substring with the content of <name>
A substring with the content of <attribute>
A substring with the content of <conditional element>
A substring with the content of <action>
A Drools rule has the following structure, according to the official documentation:
rule "<name>"
<attribute>*
when
<conditional element>*
then
<action>*
end
I've tried using this pattern, but it hasn't worked well:
^rule"(.|\n|\r|\t)+"(.|\n|\r|\t)+\bwhen\b(.|\n|\r|\t)+\bthen\b(.|\n|\r|\t)+\bend\b?$
Does anyone have an idea of how could I proceed?

I know your question is about regexp, but I would strongly advise against using it. There are way too many cases that will fail with your regexp... for instance, rule names that are a single word don't need "", rule keyword does not need to be the first thing in the line, etc...
/*this is a comment on the start of the line*/ rule X...
Instead of regexp, just use the DrlParser directly and it will give you all the information you need:
String drl = "package foo \n"
+ "declare Bean1 \n"
+ "field1: java.math.BigDecimal \n"
+ "end \n"
+ "rule bigdecimal\n"
+ "when \n"
+ "Bean1( field1 == 0B ) \n"
+ "then \n"
+ "end";
DrlParser parser = new DrlParser(LanguageLevelOption.DRL6);
PackageDescr pkgDescr = parser.parse( null, drl );
PackageDescr.getRules() will give you all the RuleDescr in the file, each RuleDescr has a getName() to give you the rule name, etc. All type safe, no edge cases, etc.

You almost got it. This work:
^rule\s+\"(.|\n|\r|\t)+\"(.|\n|\r|\t)+\bwhen\b(.|\n|\r|\t)+\bthen\b(.|\n|\r|\t)+\bend\b?$
Another solution:
^\s*rule\s+\"([^\"]+)\"[\s\S]+\s+when\s+([\s\S]+)\s+then\s+([\s\S]+)\send\s*$
Note: You missed the space and " -> \"
Tips:
You can use \s for white space charcters.
[^\"] for all non " character.
[\s\S] for all characters.
\b stop at [a-zA-Z0-9_]. \s+ stop at any non-whitespace character. It is just an extra precaution if any attribute start with a special character.
Use a program like Rad Software Regular Expression Designer. That will dramatically simplify editing and testing your regex code.

Related

regular expression for key=(value) syntax

I am currently writing a java program with regular expression but I am struggling as I am pretty new in regex.
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]/ ]*";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)";
The target syntax is key(value)+key(value)+key(value). Key is alphanumeric and value is allowed to be any combination.
This has been okay so far. However, I have a problem with '(', ')' in value. If I place '(' or ')' in the value, value includes all the rest.
e.g. number(abc(kk)123)+status(open) returns key:number, value:abc(kk)123)+status(open
It is supposed to be two pairs of key-value.
Can you guys suggest to improve the expression above?
Not possible with regular expressions at all, sorry. If you want to count opening and closing parantheses, regular expressions are, in general, not good enough. The language you are trying to parse is not a regular language.
Of course, there may be ways around that limitation. We cannot know that if you give us as little context as you did.
Get the matched group from index 1 and 2
([a-zA-Z0-9]+)\((.*?)\)(?=\+|$)
Here is online demo
The above regex pattern looks of for )+ as delimiter between keys and values.
Note: The above regex pattern will not work if value contains )+ for example number(abc(kk)+123+4+4)+status(open)
Sample code:
String str = "number(abc(kk)123)+status(open)";
Pattern p = Pattern.compile("([a-zA-Z0-9]+)\\((.*?)\\)(?=\\+|$)");
Matcher m = p.matcher(str);
while (m.find()) {
System.out.println(m.group(1) + ":" + m.group(2));
}
output:
number:abc(kk)123
status:open
Someone posted an answer with a working solution regex: ([a-zA-z0-9]+)\((.*?)\)(?=\+|$) - This works great. When I tested on online regex tester site and came back, the post had gone. Is it right solution? I am wondering why the answer has been deleted.
See this golfed regex:
([^\W_]+)\((.*?)\)(?![^+])
You can use a shorthanded character class [^\W_] instead of [a-zA-Z0-9].
You can use a negative lookahead assertion (?![^+]) to match without backtracking.
However, this is not a practical solution as )+ within inner elements will break: number(abc(kk)+5+123+4+4)+status(open)
This is the case where Java, which has the regex implementation that doesn't support recursion, is disadvantaged. As I mentioned in this thread, the practical approach would be to use a workaround (copy-paste regex), or build your own finite state machine to parse it.
Also, you have a typographical error in your original regex. [a-zA-z0-9]+ has a range "A-z". You meant to type "A-Z".
I'll do a little assumption that you're able to add a + at the end of your chunk
i.e. number(abc(kk)123)+status(open)+
If it is possible you'll have it work with:
KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!##\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]\\(\\)/ ]*?";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)+";
The changes are on line 2 adding the ( ) with escaping and replacing * by *?
The ? turn off the greedy matching and try to keep the shortest match (reluctant operator).
On line 3 adding a + at the end of the mask to help separate the key(value) fields.

Regex Lookahead and Lookbehinds: followed by this or that

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.
As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"
What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.
If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

java performance issue - regular expression VS internal String method

I'm having the following issue:
I have some string somewhere in my application that I want to check - the check is whether this string contains a character that is different than " "(white space), /n and /r
For example:
" g" - Contains
" /n " - Not Contains
" " - Not Contains
I want to do it in a reg expression, but I don't want to use the common pattern .*[a-zA-Z0-9]+.* . Instead, I want something like .*[!" ""/n"/r"]. (every character that is different than " " "/r" and "n").
My problems are that
I don't know if this pattern is valid (the above isn't working)
I'm not sure if it would be me much faster then using the
regular Strings methods.
Firstly, you mean \n and \r, and in Java this means escaping the backslash as well with \\n and \\r.
Secondly, if you merely mean to catch any non-whitespace, just use the pattern \\S* or [^\\s]. \S is non-whitespace, or \s is whitespace and [^<charset>] means "match anything that isn't one of these."
Thirdly, if this is a repeated check, be sure to only compile the regex once then use it multiple times.
Fourthly, follow usual strategy for profiling. Firstly is this in a critical strip in your application? If so then benchmark yourself.
here's something that does exactly what you want, but (like i said above), it'll be faster going over characters:
Pattern NOT_WHITESPACE_DETECTOR = Pattern.compile("[^ \\n\\r]");
Matcher m = NOT_WHITESPACE_DETECTOR.matcher(" \n \r bla ");
if (m.find()) {
//string contains a non-white-space
}
also note that the definition of whitespace in java is much wider than you specified, and even then there are whitespaces out there in unicode that java doesnt detect (there are libraries that do, however)

Using regex to remove quote

I saw a good sample, but I cannot adapt it for my problem.
I would like to remove only enclosing field " from a CSV line like :
" kkl ";"aa bb D";;12 "AA";;"SSS"-;" gg 12";" vv";"sdqs ";
expected result :
kkl ;aa bb D;;12 "AA";;"SSS"-; gg 12; vv;sdqs ;
I use Pattern and Matcher tools
This solution assumes that there is no escaped quote \" in the quoted string
.replaceAll("(?<=^|;)\"([^\"]*?)\"(?=;|$)", "$1")
I assume that you also want to strip off the " in these case: "sdfkjhksdf", ;;;"dffff"
Another solution uses possessive quantifier, whose effect relies on the assumption that " doesn't appear inside the quoted portion.
.replaceAll("(?<=^|;)(?:\"(.*?)\"){1}+(?=;|$)", "$1")
Small modification to #nhahtdh's regex in order to keep it from greedily matching outside of a CSV boundary:
.replaceAll("(?<=^|;)\"([^;]*)\"(?=;|$)", "$1");

java - split string using regular expression

I need to split a string where there's a comma, but it depends where the comma is placed.
As an example
consider the following:
C=75,user_is_active(A,B),user_is_using_app(A,B),D=78
I'd like the String.split() function to separate them like this:
C=75
user_is_active(A,B)
user_using_app(A,B)
D=78
I can only think of one thing but I'm not sure how it'd be expressed in regex.
The characters/words within the brackets are always capital. In other words, there won't be a situation where I will have user_is_active(a,b).
Is there's a way to do this?
If you don't have more than one level of parentheses, you could do a split on a comma that isn't followed by a closing ) before an opening (:
String[] splitArray = subjectString.split(
"(?x), # Verbose regex: Match a comma\n" +
"(?! # unless it's followed by...\n" +
" [^(]* # any number of characters except (\n" +
" \\) # and a )\n" +
") # end of lookahead assertion");
Your proposed rule would translate as
String[] splitArray = subjectString.split(
"(?x), # Verbose regex: Match a comma\n" +
"(?<!\\p{Lu}) # unless it's preceded by an uppercase letter\n" +
"(?!\\p{Lu}) # or followed by an uppercase letter");
but then you would miss a split in a text like
Org=NASA,Craft=Shuttle
consider using a parser generator for parsing this kind of query. E.g: javacc or antlr
As an alternative, if you need more than one level of parentheses, you can create a little string parser for parsing the string character by character.

Categories