How to use java regex to get text between brackets - java

So I know this question may appear similar to other questions out there regarding regex and such. I believe mine is unique because I'm using java to parse some javascript, which can contain brackets within brackets for anonymous functions etc. Consider the following as an example:
describe('a jasmine describe', function (){
it('login', function(){
//some function stuff
});
it('another it statement', function() {
//some additional stuff
});
});
What I ultimately want is:
Group 1: "a jasmine describe"
Group 2: all of the content between open/close brackets of the describe
I believe I have the regex to get the Group 1 I'm looking for which is:
Pattern r = Pattern.compile("(?:describe\\s*\\(\\s*')(.*?)(?=')", Pattern.CASE_INSENSITIVE);
But I have no idea how to get the contents between the open/close of the specific describe bracket.

Regex may not be best tool for that, but you can try withe regex:
^(?m)(?<indent>\s*)describe\('([^']+)'[^{]+\{([\s\S]+?)\n\k<indent>\}\);
DEMO
^(?m) - beginning of a line, multiline (could be replaced with
using Pattern.MULTILINE),
(?<indent>\s*) - capture indention befeore method,
describe\( - describe followed by opening of parathesis
'([^']+)' - matching text between single quotes, need to be modified if text could consist ',
[^{]+\{ - match text up to first {
([\s\S]+?) - match anything, with reluctant quantifire
\n\k<indent>\}\); - new line, followed by captured indentation,
followed by closing of method body,
which will capture 'a jasmine describe' in 2nd group, and the describe content into 3rd group, because of additional group indent(named 1st group), which should ensure, that regex will match content of {...}. The 1 group (<indent>) capture a indentation before the describe function in the code, and then use it as a boundary, where finish matching (on a } preceded by a proper indentation). This is kind of workaround for matching nested brackets, but the code need to be well formated.
Ofcoure, is Java code, you need to double \ backslashes.

This regex matches your target capturing groups 1 and 2 as required:
describe\('([^']*).*?function\s*\(\)\s*\{(([^{]*\{[^}]*\})*[^}]*)\}
This will handle any number of non-nested curly-bracketed input in the body of the function.
See live demo.

Related

Using regular expression, how to remove matching sequence at the beginning and ending of the text but keeping what's in the middle?

my problem is very simple but I can't figure out the correct regular expression I should use.
I have the following variable (Java) :
String text = "\033[1mYO\033[0m"; // this is ANSI for bold text in the Terminal
My goal is to remove the ANSI codes with a single regular expression (I just want to keep the plain text at the middle). I cannot modify the text in any way and those ANSI codes will always be at the same place (so one at the beginning, one at the end, though sometimes it's possible that there is none).
With this regular expression, I will remove them using replaceAll method :
String plainText = text.replaceAll(unknownRegex, "");
Any idea on what the unknown regex could be?
Well, you use a single regex that has the ansi codes optionally at the beginning and end, captures anything in between and replaces the entire string with the value of the group: text.replaceAll("^(?:\\\\\\d+\\[1m)?(.*?)(?:\\\\\\d+\\[0m)?$", "$1"). (this might not capture every ansi code - adjust if needed).
Breaking the expression down (note that the example above escapes backslashes for Java strings so they are doubled):
^ is the start of the string
(?:\\\d+\[1m)? matches an optional \<at least 1 digit>[1m
(.*?) matches any text but as little as possible, and captures it into group 1
(?:\\\d+\[0m)? atches an optional \<at least 1 digit>[0m
$ is the end of the input
In the replacement $1 refers to the value of capturing group 1 which is (.*?) in the expression.
Found the answer thanks to a comment that disappeared.
Actually, i just need to make a group to get what's in the middle of the string and using it ($1) to replace the whole thing :
String plainText = text.replaceAll("\\033\\[.*m(.+)\\033\\[.*m", "$1")
Not sure if this will remove every ANSI codes but that is enough for what I want to do.

Regexp captured group backreference doesn't work [duplicate]

I have a regex that I use to match Expression of the form (val1 operator val2)
This regex looks like :
(\(\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*\))
Which is actually good and matches what I want as you can see here in this demo
BUT :D (here comes the butter)
I want to optimise the regex itself by making it more readable and "Compact". I searched on how to do that and I found something called back-reference, in which you can name your capturing groups and then reference them later as such:
(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*(\g{Val})\s*\))
where I named the group that captures the left side of the expression Val and later I referenced it as (\g{Val}), now the problem is that this expression as you can see here only case where left side of the expression is exactly the same as right side! e.g. (a==a) or (1==1) and does not match expressions such as (a==b)!
Now the question is: is there a way to reference the pattern instead of the matched value?!
Note that \g{N} is equivalent to \1, that is, a backreference that matches the same value, not the pattern, that the corresponding capturing group matched. This syntax is a bit more flexible though, since you can define the capture groups that are relative to the current group by using - before the number (i.e. \g{-2}, (\p{L})(\d)\g{-2} will match a1a).
The PCRE engine allows subroutine calls to recurse subpatterns. To repeat the pattern of Group 1, use (?1), and (?&Val) to recurse the pattern of the named group Val.
Also, you may use character classes to match single characters, and consider using ? quantifier to make parts of the regex optional:
(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|[*\/+-]|[=!><]=|[><])\s*((?&Val))\s*\))
See the regex demo
Note that \'.*\' and \[.*\] can match too much, consider replacing with \'[^\']*\' and \[[^][]*\].
What language/application are you using this regular expression in?
If you have the option you can specify the different parts as named variables and then build the final regular expression by combining them.
val = "([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])"
op = "(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)"
exp = "(\(" .. val .. "\s*" .. op .. "\s*" .. val .. "\))"

Matching the last group of something in Java

I have the problem to define the regexpression (for a Java program), that gives me the last matching group of something. The reason for that is the conversion of some text files (here: the export of some wiki) to the new format of the new wiki.
For example, when I have the following text:
Here another include: [[Include(a/a-1)]]
The hierarchy of the pages is:
/a
/a-1
The old wiki referenced the hierarchy name, the new wiki will only have the title of the page. The new format should look like:
{include:a-1}
Currently I have the following regular expression:
/\[\[Include\(([^\)]+)\)\]\]/
which matches from the example above a/a-1, but I need a regular expression that matches only a-1.
Is it possible to construct a regular expression for java that matches the last group only?
So for the following original lines:
[[Include(a)]]
[[Include(a/b)]]
[[Include(a/a-1)]]
[[Include(a/a-1/a-2)]]
I would like to match only
a
b
a-1
a-2
This is the regex you're looking for. Group 1 has the text you want, see the captures pane at the bottom right of the demo, as well as the Substitutions pane at the bottom.
EDIT: per your request, replaced the [a-z0-9-] with [^/] (Did not update the regex101 demo as this regex, which I confirmed to work, breaks in regex101, which uses / as a delimiter, even when escaping the /. However here is another demo on regexplanet)
Search:
\[\[Include\((?:[^/]+\/)*([^/]+)\)\]\]
Replace:
{include:$1}
How does it work?
After the opening bracket of the Include, we match a combination of characters such as a-1 (made of letters, dash and digits) followed by a forward slash, zero or more times, then we capture the last such combination of characters.
In the few languages that support infinite-width lookbehinds, we could match what you want without relying on Group 1 captures.

Match text with possible brackets between brackets

I need to match text between ${ and }
Example:
${I need to match this text}
Simple regex \$\\{(.+?)\\} will work fine until I place some of } inside the text
Curly brackets are paired inside the text to match.
Is there any possibility to solve this by means of Regular Expressions?
\$\{((?:\{[^\{\}]*\}|[^\{\}]*)*)\}
If we meet an opening bracket, we look for its pair, and after the closing one we proceed as usual. This can't handle more than one level of nested brackets.
The main building block here in [^\{\}]* - any non-bracket sequence. It can be surrounded by brackets \{[^\{\}]*\} but it might be not (?:\{[^\{\}]*\}|[^\{\}]*). Any count of these sequences can be present, hence * at the end.
Any level of nesting might require a recursive regex, not supported by Java. But any fixed amount can be matched by carefully extending this idea.
Add a $ to end of the ReGex and don't escape it. The dollar sign means it'll check for the previous letter or symbol at the very end.
ReGex: \${(.+?)}$
Java Formatted: \\${(.+?)}$

RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks
If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.
To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.
you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.

Categories