Regex Pattern Priority - java

I have set the pattern Part\s[V|IV|III|II|I][:]? the problem is that when it comes across Part III, for example, then it acknowledges Part I as the match and cuts out the other two II's. Is there no way to make the III in the pattern have a higher priority so it's included in what's found?

The issue you have is that a character class (the [ and ]) only matches a single character. So it matches any one of the characters within it, in this case you've simply restated I quite a few times. What you need is to or multiple strings rather than characters. I.e.
(Part\s(V|IV|III|II|I)[:]?)
Note the round brackets instead of square ones.

You may use (Part\s(V|IV|III|II|I)[:]?)

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.
If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.
Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Regex for multiple occurrences of specific words

Hello
I'm trying to create a validation rule that checks the regular expression to accept only specific phrases. Regex is based on Java.
Here are examples of correct inputs:
1OR2
2
1 OR 2 OR 15
( 2OR3) AND 1
(12AND13 AND1)OR(4 AND5)
((2AND3 AND 1)OR(4AND5))AND6
but I would be happy if only the regex could accept anything like :
())34AND(4
I have no idea how to create a regex to check if the brackets open and close correctly(they can be nested). I assumed it can be impossible to check it in regex so the proper validation for the brackets I've already made in the code(stack implementation). In the code I have a second step validation of the phrase.
All I need the regex to do is to check if there are these specific things inside the phrase:
numbers, round brackets, words AND and OR with multiple occurrences and whitespaces are allowed.
It should NOT accept letters or other characters.
All I managed to create so far is this:
^[0-9 \\(][0-9 \\(\\)]*
also tried adding something like:
\\b(AND|OR)\\b
inside the second pair of brackets but with no luck.
I cannot figure out how to correct it to add OR and AND words.
I used the following and matched all the inputs you gave:
^[^\)][0-9 \( (AND|OR)]*$
I assumed you didn't want to start with ), which is why I included ^[^\)].
In case you weren't aware, I use https://www.regexpal.com to check my regular expressions for code.
Since you have an arbitrary number of nested elements it's arguably not possible with regex.
For demonstration purposes only, this matches zero or more conjunctions and one set of parenthesis:
^\d+(\s*(?:AND|OR)\s*(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\)))*$|^(\d+|\(?\s*\d+(\s*(?:AND|OR)\s*\d+)\s*\))\s*(\s*(?:AND|OR)\s*\d+)*$
That's it. Adding more sets and levels of nested parenthesis leads to exponentially increasing complexity - till it breaks altogether.
Demo

Matches A but not B in Java Regex?

I have a big document. Lets scale it to
location=State-City-House
location=City-House
So What I want to do is replace all those not starting with State, with some other string. Say "NY". But those starting with State must remain untouched.
So my end result would be
location=State-City-House
location=NY-City-House
1.Obviously I cant use String.replaceAll().
2.Using Pattern.matcher() is tricky since we are using two different patterns where one must be found and one must not be found.
3.Tried a dirty way of replacing "location=State" first with "bocation=State" then replacing the others and then re-replacing.
So, A neat and simple way to do it?
You can definitely use replaceAll with a negative lookahead:
String repl = input.replaceAll( "(?m)^(location=)(?!State)", "$1NY-" );
(?m) sets MULTILINE modifier so that we match anchors ^ and $ in each line
(location=) matches location= and captures the value in group #1
(?!State) is the negative lookahead to fail the match when State appears after the captured group #1 i.e. location=
In replacement we use $1NY- to make it location=NY- at start.
RegEx Demo
If I understand your intention correctly, you don't actually have the string "State" in your input, but varying strings that represent states.
But some of your text lines are missing the state altogether and only have the name of the City and House. Is that correct? In that case, the defining characteristic between the 2 kinds of lines is the number of dashes.
^location=([^-]+)-([^-]+)$
The above regex matches only full lines with only 1 dash.
I might have misunderstood the task. It would be easier if you would post some of the actual input.

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

How to bound +/* for a regex group?

Say I have the regex:
(CC|NP)*
As such it creates problems in look-before regexes in Java. How shall I write it to avoid those problem?
I thought of re-writing it as:
(CC|NP){1,9}
Testing on regexr it seems like the upperbound is ignored completely.
In Java those quantitiers {} seem to work only on non-group regex elements as in:
\w+\[\S{1,9}\]
Sorry, look behind patterns usually have restrictions on the sub pattern. See f.x. Why doesn't finite repetition in lookbehind work in some flavors?p. Or search for "lookbehind pattern restrictions" on the web.
You may try to write down all fixed length variants of the look behind pattern as alternating pattern. But this might be many...
You may also simulate lookbehind by normally matching the inner pattern and match and group your actual target: (?:CC|NP)*(.*)
I'm not sure of where you percieve the problem. Quantifiers act on groups just like any entity.
So, \w+\[\S{1,9}\] could have been written \w+\[(\S){1,9}\] with the same result.
As far as your example on regexr, nothing is broken there. It matches what it's supposed to.
(PUN|CC|NP){1,3} will greedily try to match any of the alternations (in left-to-right priority). There will be no breaks in what it will match. It matches 1-3 consecutive occurances of PUN or CC or NP.
The sample string you provided had a space between CC's, so since a space does not exist in the regex, it is not matched. The only thing that is matching is a single CC.
If you want to account for a space, it can be added to the grouping like this:
(?:(?:PUN|CC|NP)\s*){1,3}
If you want to only allow spaces between the alternation's, it can be done like this:
(?:PUN|CC|NP)(?:\s*(?:PUN|CC|NP)){0,2}

Categories