Regex not always working with angle brackets - java

So, in the process of writing a Brainfuck translator in Java I need to split the string following next rules: any of the [ ] , . characters or any sequence of the + - < > should be followed by newline. Here's the input string:
..-<[-]>..[[<<[+[-<-->>+,>-.++]-,>,<[.],][<.,<-]+[-,<->,-]<<[>->-.<-[.<++,>++,].-]]]
And my code:
s = s.replaceAll("(\\+|-|<|>)+", "$0\n")
.replaceAll("\\.|\\,|\\[|\\]", "$0\n");
And the result (SO won't allow this here): https://pastebin.com/ZaT8d5ve
What was expected: https://pastebin.com/gNxcgTSP
It seems that connections of brackets with plus-minus signs are faulty, while angle brackets with square brackets and dot/comma are fine. I can't really get, what's wrong with my solution?

Your output does exactly what you described, sequence of the + - < > is followed by \n so -< becomes -<\n not -\n<\n.
If I understand you correctly you want to split of sequence of same characters which is either + - < > to have \n after it. If that is the case then instead of
s.replaceAll("(\\+|-|<|>)+", "$0\n")
you can use
s.replaceAll("(\\+|-|<|>)\\1*", "$0\n")
\1 is backreference to match from group 1 (here (\\+|-|<|>)), so it matches one of those characters and its optional following repetitions.

You seem to think that
(\\+|-|<|>)+
would match only sequences of identical characters like ++ whereas it also matches any sequence of these characters like -<-->>.
You also don't need two regexes in sequence. The following should do:
s = s.replaceAll("([+<>-])\\1*|[,.\\[\\]]", "$0\n");

Related

how to split a string with a condition in java

I am trying to split a string by a delimiter only in certain situations.
To be more specific, I want to split the conditions of a split statement.
I want to be able to split
"disorder == 1 or ( x < 100)"
into
"disorder == 1"
"(x < 100)"
If I use split("or") I would get a split inside disorder too :
"dis"
"der == 1"
"( x < 100)"
And if I try to use regex like split("[ )]or[( ]") I would lose the parentheses from ( x < 100) :
"disorder == 1"
"x < 100)"
I am looking for a way to split the string only if the delimiter is surrounded by space or parentheses, but I want to keep the surroundings.
You want to use Lookaheads and Lookbehinds for the spaces/parenthesis, so something like this:
String input = "disorder == 1 or( x < 100)";
String[] split = input.split("(?<=[ )])or(?=[ (])");
The [ )] and [ (] mean to look at spaces or parenthesis. This can of course be replaced with any other boundary characters, or even a literal regex boundary \\b.
The (?<=...) is a positive lookbehind. So it only matches or when it has a space or ) in front of it, but doesn't remove them with the split.
The (?=...) is a lookahead. So it only matches or followed by a space or (, but doesn't remove them with the split.
Try it online.
As flakes pointed out in the comments, you can use the word boundary character.
The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
String x = "disorder == 1 or( x < 100)";
for(String s : x.split("\\bor\\b"))
System.out.println(s);
Result:
disorder == 1
( x < 100)
For a solution using lookahead/lookbehind, see Kevins excellent answer.
I'm not entirely sure what you are doing this for: The example you presented gives a somehow a very small view of what you want to do and what you want to do this for. Correct me if I'm wrong but it seems that you want to parse arbitrary expressions of some kind of programming language.
In general you can't approach things like this in such a simple way. This is an expression. It has a hierarchical structure. No simple splitting - even not with RegEx - will work here in general as RegEx can not honor this hierarchical structure.
To do this properly you need to parse the expression to some extent. This is done by splitting the expression into simple tokens, rebuild the hierarchy in a (simple) tree data model and then you can analyze it in any way you want. Actually you can use RegEx to identify the individual tokens, but you need to build a tree-like data structure first before you can work with it.
Building this tree like structure is not so trivial as you have to consider the precedence of various operators within your expression. But iff (!) you have a very specific field of application - f.e. a list of expressions with some very limited structure - you might be able to use the token list directly.
Here's an example for this tokenization process. Your character sequence disorder == 1 or( x < 100) might parse into some token sequence such as this:
W:"disorder"
OP:"=="
NUM:"1"
W:"or"
B:"("
W:"x"
OP:"<"
NUM:"100"
B:")"
Now you can identify the word "or" and deal with the expression the way you want.
The trick then would be to perform reasonable tokenization. For this I recommend to define a set of regular expressions, each one recognizing either a number, a word or some operator or bracket. Process each string by looking at the next characters with each individual RegEx, try to match these characters with these regular expressions. If you have a match, emit a token as you have found one, then advance to the position in your character sequence after the match to then continue with the rest of your character sequence.
If you have been able to pass through the character sequence (and emitting tokens), then parsing is successfully completed. If you fail with all individual RegExes provided, there is some syntactical problem in the input data. After tokenization you can further do with your tokens whatever you want.
Looks like you need to have a more complex regular expression where the word "or" plus a single preceding and succeeding character are non alphabetic. For example:
((.+)+(\Wor\W)+)+
Something like this, where you identify the pattern of characters, a separating non-word character, the literal word "or", and another separating non-word character. This may not be the exact form you need, but something similar to this that captures the pattern would probably work for you.
You can just replace the or into anything else that's not in the string and split it by that charachter.
For example:
String [] n = input.replace("or(",":(").split(":");

Java Regex with "Joker" characters

I try to have a regex validating an input field.
What i call "joker" chars are '?' and '*'.
Here is my java regex :
"^$|[^\\*\\s]{2,}|[^\\*\\s]{2,}[\\*\\?]|[^\\*\\s]{2,}[\\?]{1,}[^\\s\\*]*[\\*]{0,1}"
What I'm tying to match is :
Minimum 2 alpha-numeric characters (other than '?' and '*')
The '*' can only appears one time and at the end of the string
The '?' can appears multiple time
No WhiteSpace at all
So for example :
abcd = OK
?bcd = OK
ab?? = OK
ab*= OK
ab?* = OK
??cd = OK
*ab = NOT OK
??? = NOT OK
ab cd = NOT OK
abcd = Not OK (space at the begining)
I've made the regex a bit complicated and I'm lost can you help me?
^(?:\?*[a-zA-Z\d]\?*){2,}\*?$
Explanation:
The regex asserts that this pattern must appear twice or more:
\?*[a-zA-Z\d]\?*
which asserts that there must be one character in the class [a-zA-Z\d] with 0 to infinity questions marks on the left or right of it.
Then, the regex matches \*?, which means an 0 or 1 asterisk character, at the end of the string.
Demo
Here is an alternative regex that is faster, as revo suggested in the comments:
^(?:\?*[a-zA-Z\d]){2}[a-zA-Z\d?]*\*?$
Demo
Here you go:
^\?*\w{2,}\?*\*?(?<!\s)$
Both described at demonstrated at Regex101.
^ is a start of the String
\?* indicates any number of initial ? characters (must be escaped)
\w{2,} at least 2 alphanumeric characters
\?* continues with any number of and ? characters
\*? and optionally one last * character
(?<!\s) and the whole String must have not \s white character (using negative look-behind)
$ is an end of the String
Other way to solve this problem could be with look-ahead mechanism (?=subregex). It is zero-length (it resets regex cursor to position it was before executing subregex) so it lets regex engine do multiple tests on same text via construct
(?=condition1)
(?=condition2)
(?=...)
conditionN
Note: last condition (conditionN) is not placed in (?=...) to let regex engine move cursor after tested part (to "consume" it) and move on to testing other things after it. But to make it possible conditionN must match precisely that section which we want to "consume" (earlier conditions didn't have that limitation, they could match substrings of any length, like lets say few first characters).
So now we need to think about what are our conditions.
We want to match only alphanumeric characters, ?, * but * can appear (optionally) only at end. We can write it as ^[a-zA-Z0-9?]*[*]?$. This also handles non-whitespace characters because we didn't include them as potentially accepted characters.
Second requirement is to have "Minimum 2 alpha-numeric characters". It can be written as .*?[a-zA-Z0-9].*?[a-zA-Z0-9] or (?:.*?[a-zA-Z0-9]){2,} (if we like shorter regexes). Since that condition doesn't actually test whole text but only some part of it, we can place it in look-ahead mechanism.
Above conditions seem to cover all we wanted so we can combine them into regex which can look like:
^(?=(?:.*?[a-zA-Z0-9]){2,})[a-zA-Z0-9?]*[*]?$

Java regex - Replace "x^2" with "x²" but NOT "x^27" with "x²7"

I got a string of an equation where I want to replace all occurrences of the scheme "x^2" with "x²".
My code:
String equation = "x^2";
equation = equation.replace("^2", "\u00B2"); // u00b2 is unicode for '²'
System.out.println(equation);
This works for "x^2" but for example "x^25" I'm getting the string "x²5", but in such a case I want it to stay the same "x^25".
Another example:
"x^2 + 6x" -> "x² + 6x" // ... x squared
"x^28 + 6x" -> "x^28 + 6x" // ... x to the power of 28
Thank you!
EDIT:
The solution from "Mshnik" works perfectly, even with a custom character like "y^2" instead of "x^2", thanks!
Here's a regex that will match 2 in x^2, the 2 in a^2+... but not the 2 in x^20:
(?<=\w)\^2(?![0-9.])
Specifically:
(?<= <EXP>) is a positive lookbehind on <EXP>, More explanation here
\w matches any alphabetic character, upper or lower case.
\^ matches the ^ character literally
2 matches the 2 character literally
(?! <EXP>) is a negative lookahead on <EXP> More explanation here.
[0-9.] matches all numbers and decimals, like 58 and 3.14.
Thus all together it matches the 2 that is preceded by x^ and isn't followed by a digit.
With that, you can use java's Pattern class to find and rebuild a string with the new ². More on that here
Note that in order to get a backslash into a java regex, you need the literal backslash character represented by \\. Thus the final result looks like (?<=\\w)\\^2(?![0-9.]).

Regex expression to validate a formula

I am new to regex and currently building web application in Java.I have the following requirements to validate a formula:
Formula must start with “T”
A formula can contain the following set of characters:
Digit: 0 - 9
Alpha: A - Z
Operators: *, /, +, -
Separator: ;
An operator must always be followed by a digit
The character “T” must always be followed by a digit or an alpha.
The separator must always be followed by “T”.
The character “M” must always be followed by an operator.
I manage to build up the following expression as shown below:
^[T][A-Z0-9 -- \\+*;]*
But i don't know how to add the following validation with regex above:
An operator must always be followed by a digit
The character “T” must always be followed by a digit or an alpha.
The separator must always be followed by “T”
The character “M” must always be followed by an operator.
Valid sample: TA123;T1*2/32M+
Invalid Sample: T+qMg;Y
^(?!.*[*+/-]\\D)(?!.*T\\W)(?!.*[;:][^T])(?!.*M[^*+/-])[T][A-Z0-9 +/*;:-]*$
You can use this.See demo.
https://regex101.com/r/sS2dM8/7
We lack a bit of information to fully understand what you want. A couple examples would help.
For now, a small regexp :
^(T[A-LN-Z0-9]*M[+-/*][0-9];?)*
EDIT :
From my understanding, this should be close to what you're looking for :
^(T([A-LN-Z0-9]*M?[+\-/*]?[0-9]?)*;?)+
https://regex101.com/r/hT7aP2/1
This regexp forces the line to begin with a T, then have 0 to many [A-LN-Z0-9] range, meaning all your alphas and digits except M.
Then it needs to have a M followed by an operator in the range of [+-/*] *(pretty much +, -, / and , except that - and / are special characters so we tell the regexp that we want these characters, and not the meaning they're supposed to have).
Then it continues by one to many digits, and ends by a ";" that might or might not be there.
And everything in the parenthesis can be repeated from 0 to several times
I would have liked examples of what you want to validate... For example, we don't know if the line HAVE to end with a ";"
Depending on what you want, splitting the string you want to validate using the character ";" and validating each of the generated string with that regexp might work

Regex - starts with OPERATION and must be followed by either an integer or double

SO I have the follow input string
OPERATION: 12, 12.32, 54.3332
OPERATION can be any of MIN, MAX, SUM. The regex should accept strings that only start with any of those three words follow by colon and the followed by either integer or double.
I have been googling and fidling around for hours now before I decided to turn to you guys.
Thanks for your time!
EDIT:
So far I have ended with this regex
^[MAX:SUM:AVERAGE:MIN:(\\d+(\\.\\d+)?), ]+$
It matches "MAX: " as correct as well as "MAX:" and "MA: . It also matches the strings of the following format : "MAX: 12, 12.3323......"
You misunderstand the meaning of []. These refer to single character alternatives.
So [MAX:SUM:AVERAGE:MIN:(\\d+(\\.\\d+)?), ] is either M or A or X or .... So something like MMMMM should also match.
You may want something more like this:
^(MAX|SUM|AVERAGE|MIN): (\\d+(\\.\\d+)?(, (?=.)|$))+$
Explanation:
(MAX|SUM|AVERAGE|MIN) is either MAX, SUM, AVERAGE or MIN.
": " refers to the actual characters : and space.
\\d+(\\.\\d+)? is what you had before.
", " refers to the actual characters , and space.
(?=.) is look-ahead. Checking that the following characters (?=) matches any single character (.) (thus not end-of-string) so there isn't a ", " required at the end.
So , (?=.)|$ is either ", " not at the end or nothing at the end.
Alternative: Not using look-ahead
^(MAX|SUM|AVERAGE|MIN): (\\d+(\\.\\d+)?(, \\d+(\\.\\d+)?)*)$
Test for both.
Reference.
(?:MIN|MAX|SUM):[ ]+\d+(?:[.]\d+)?(?:,[ ]+\d+(?:[.]\d+)?)*
Explanation
(?:MIN|MAX|SUM) - either one of the operations, non-capturing
: - a literal :
[ ]+ - 1 or more space characters
\d+ - 1 or more digits
(?:[.]\d+)? - optionally a literal . followed by 1 or more digits, non-capturing
(?:,...)* - a literal comma and another number, 0 or more times, non-capturing
[.] could also be written as \. but putting magic characters in a character class instead is a best practice, same goes for using [ ] instead of a space.
References
Java Patterns
Hope that helps.
java.util.regex.Pattern accept regular expression formed by group of Greedy format expressions. You can create group to divide your input string.
OPERATION: NUM, NUM, DECIMAL, DECIMAL
Your regular expression is:
([MIN,MAX,SUM]+):([0-9]{4}.[0-9]{2}),(...),(...),(...)
and use:
Matcher.group()
to iterate upon found groups.
Note: I supposed that your number format is "dddd.dd"

Categories