Regex - Disallow certain characters to appear consecutively - java

I'm not sure if this is possible or not:
Writing a program to convert an infix notation to postfix notation. All is working well so far but trying to implement validation is proving difficult.
I'm trying to use a regex to validate an infix notation, conforming to the following rules:
String must only start with a number or ( (program does not allow negative numbers)
String must only end with number or )
String must only contain 0-9*/()-+
String must not allow following characters to appear together +*/-
I have a regex which conforms to the first 3 rules:
(^[0-9(])([0-9+()*]+)([0-9)]+$)
Is it possible to use regex to implement the last rule?

I will answer only to fourth rule as you have problem only with it.
Yes, there is a possibility, but I think regex is not appropriate tool to check that...
This pattern ^(?(?=.*\+)(?!.*[\*\/-])).+$ will match any string that contain + and not contain other characters: /,*,-. For one character is already lengthy and hard to read. See demo.
It uses conditional expression (?...) to check if lookahead checking for + was successfull, if it is, then negative lookahead assures that you won't have any of \*- characters.
For all characters, the regex will become very big and hard to maintain.
That's why I don't recommend it for this task.

I agree with Michał Turczyn that regex is not the task for this, but not with the reason. It is easy to implement your restrictions. However, your restrictions also allow expressions like (0+3, 2(*4), ((((1, and other things you likely don't want - so regex validation is kind of pointless. If you were writing this with a regex engine with some significant power like PCRE (Perl, PHP) or Onigmo (Ruby), you can fake a parser in regex; but in Java, regex is quite restricted in what it can do. It is enough for the requirements in the question, though:
^[0-9(](?:(?![+*/-][+*/-])[0-9+*/()-])*[0-9)]$
starts with digit or paren
any number of repetitions of any allowed character, such that that character and the next character aren't both operators
ends with digit or thesis.

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.
If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.
Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Does the Java regex library optimize for any characters .*?

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.
Pattern pattern = Pattern.compile(regex);
But suppose I used a .* to specify any number of characters. So it's basically a wildcard.
Pattern pattern = Pattern.compile(".*");
Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*
In your case, I could just use a possessive quantifier to avoid any backtracking:
.*+
The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.
Here is what Cristian Mocanu's writes in his Optimizing regular expressions in Java about a case similar to .*:
Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).
Some of the hints on Java regex optimization from the same source:
If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).
Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

Negate a regex backreference, without using lookarounds

So another question led me to wondering if there is a way to negate a regex backreference without using lookarounds? The original post is here, but the approved answer is using lookarounds. In places those are not supported, is there a clean way to handle negative backreferences? I have found on a regex site [^\x] where x is the backref. number, does not work as intended.
For example:
Find a number followed directly by any other number (but dynamic in the number).
It would make sense to have (\d)[^\1], but inside a character class, everything is taken literally.
is a way to negate a regex backreference without using lookarounds?
No.
In theory, how are the lookarounds being generated by the regex engine?
Lookarounds are not "generated" with other simpler regex constructs, they are a distinct feature of their own.

Java Regular Expression for number of exactly 5 digits anywhere in the string

I'm trying to create a regular expression to parse a 5 digit number out of a string no matter where it is but I can't seem to figure out how to get the beginning and end cases.
I've used the pattern as follows \\d{5} but this will grab a subset of a larger number...however when I try to do something like \\D\\d{5}\\D it doesn't work for the end cases. I would appreciate any help here! Thanks!
For a few examples (55555 is what should be extracted):
At the beginning of the string
"55555blahblahblah123456677788"
In the middle of the string
"2345blahblah:55555blahblah"
At the end of the string
"1234567890blahblahblah55555"
Since you are using a language that supports them use negative lookarounds:
"(?<!\\d)\\d{5}(?!\\d)"
These will assert that your \\d{5} is neither preceded nor followed by a digit. Whether that is due to the edge of the string or a non-digit character does not matter.
Note that these assertions themselves are zero-width matches. So those characters will not actually be included in the match. That is why they are called lookbehind and lookahead. They just check what is there, without actually making it part of the match. This is another disadvantage of using \\D, which would include the non-digit character in your match (or require you to use capturing groups).

Simple regex required

I've never used regexes in my life and by jove it looks like a deep pool to dive into. Anyway,
I need a regex for this pattern (AN is alphanumeric (a-z or 0-9), N is numeric (0-9) and A is alphabetic (a-z)):
AN,AN,AN,AN,AN,N,N,N,N,N,N,AN,AN,AN,A,A
That's five AN's, followed by six N's, followed by three AN's, followed finally by two A's.
If it makes a difference, the language I'm using is Java.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
should work in most RE dialects for the tasks as you specified it -- most of them will also support abbreviations such as \d (digit) in lieu of [0-9] (but if alphabetics need to be lowercase, as you appear to be requesting, you'll probably need to spell out the a-z parts).
Replace each AN by [a-z0-9], each N by [0-9], and each A by [a-z].
30 seconds in Expresso:
[a-zA-Z0-9]{5}[0-9]{6}[a-zA-Z0-9]{3}[0-9]{2}
Case insensitive, but you can probably define that in Java instead of the regex.
For the example you posted, the following should work fine.
(([A-Za-z\d])*,){5}+(([\d])*,){6}+(([A-Za-z\d])*,){3}+([\d])*,[\d]*
In Java you should be able use it like this:
boolean foundMatch = subjectString.matches("(([A-Za-z\\d])*,){5}+(([\\d])*,){6}+(([A-Za-z\\d])*,){3}+([\\d])*,[\\d]*");
I used, this tool to help in learning RegEx, it also make this really easy.
http://www.regexbuddy.com/
Try looking at some simple java regex tutorials such as this
They'll tell you how you form regular expressions and also how to use it in java.
This should match the pattern you request.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
In addition, you could add Beginning of String / End of String matches, if your string match should fail if any other chars are in it:
^[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}$

Categories