Match text between characters (avoid nesting) - java

Given:
"abc{defghij{kl}mnopqrst}uvwxyz{aaaaaaaaaa}"
I want to match the text between the characters { and the last } excluding nesting - i.e. the texts {defghij{kl}mnopqrst} and {aaaaaaaaaa}.
Without the nested {kl}, the regex expression \{[^{}]*\} works just fine. But not with the nested {kl}.
Is there a way to do this? If not possible, can I say "match text between { and } where the size of the enclosed text is at least, e.g. 3, characters so that the nested {kl} which contains two characters is not matched? (I'm assuming one level nesting)
Editor: https://www.freeformatter.com/java-regex-tester.html
Thanks,

In your problem since nesting levels will not reach two, it is possible to solve it with a readable, short regex and that would be:
\{(?:\{[^{}]*}|[^{}]+)*}
In Java you have to escape opening braces, as I did.
Above regex matches an opening brace then looks for either something other than { and } characters (i.e. [^{}]+) or something enclosed in braces {[^{}]*} and repeats this pattern as much as possible then expects to match a closing brace.
See live demo here

Related

A regex that will capture all text after a colon but ends with a comma - within can be nested brackets, commas etc

I am trying to redact something that log4j is overriding in general. To do this, I am trying to change a regex to ensure it captures what I need...an example...
"definition":{"schema":{"columns":[{"dataType":"INT","name":"column_a","description":"description1"},{"dataType":"INT","name":"column_b","description":"description2"}]}}}, "some other stuff": ["SOME_STUFF"], etc.
Hoping to capture just...
{"schema":{"columns":[{"dataType":"INT","name":"column_a","description":"*** REDACTED ***"},{"dataType":"INT","name":"column_b","description":"description"}]}}}
I have this...
(?<=("definition":{))(\\.|[^\\])*?(?=}})
Where if I keep adding a } at the end it will keep highlighting what I need. The problem is that there is no set number of nested elements in the list.
Is there anyway to adjust the above so I can capture everything within the outer brackets?
If you don't have other brackets after the last one you're trying to match, this regex should work for you:
(?<=\"definition\":)\{.*\}(?:\})
The main difference is moving the brackets from the lookarounds to the matching part.
Check the demo here.
This regex should work for you if you cannot use a proper JSON parser:
(?<=\"definition\":).+?\}(?=,\h*\")
RegEx Demo
Breakdown:
(?<=\"definition\":): Lookbehind condition to make sure we have "definition": before the current position
.+?\}: Match 1+ of any characters ending with }
(?=,\h*\"): Lookahead to assert that we have a comma then 0 or more spaces followed by a " ahead of the current position
In Java use this regex declaration:
String regex = "(?<=\"definition\":).+?\\}(?=,\\h*\")";

How to use java regex to get text between brackets

So I know this question may appear similar to other questions out there regarding regex and such. I believe mine is unique because I'm using java to parse some javascript, which can contain brackets within brackets for anonymous functions etc. Consider the following as an example:
describe('a jasmine describe', function (){
it('login', function(){
//some function stuff
});
it('another it statement', function() {
//some additional stuff
});
});
What I ultimately want is:
Group 1: "a jasmine describe"
Group 2: all of the content between open/close brackets of the describe
I believe I have the regex to get the Group 1 I'm looking for which is:
Pattern r = Pattern.compile("(?:describe\\s*\\(\\s*')(.*?)(?=')", Pattern.CASE_INSENSITIVE);
But I have no idea how to get the contents between the open/close of the specific describe bracket.
Regex may not be best tool for that, but you can try withe regex:
^(?m)(?<indent>\s*)describe\('([^']+)'[^{]+\{([\s\S]+?)\n\k<indent>\}\);
DEMO
^(?m) - beginning of a line, multiline (could be replaced with
using Pattern.MULTILINE),
(?<indent>\s*) - capture indention befeore method,
describe\( - describe followed by opening of parathesis
'([^']+)' - matching text between single quotes, need to be modified if text could consist ',
[^{]+\{ - match text up to first {
([\s\S]+?) - match anything, with reluctant quantifire
\n\k<indent>\}\); - new line, followed by captured indentation,
followed by closing of method body,
which will capture 'a jasmine describe' in 2nd group, and the describe content into 3rd group, because of additional group indent(named 1st group), which should ensure, that regex will match content of {...}. The 1 group (<indent>) capture a indentation before the describe function in the code, and then use it as a boundary, where finish matching (on a } preceded by a proper indentation). This is kind of workaround for matching nested brackets, but the code need to be well formated.
Ofcoure, is Java code, you need to double \ backslashes.
This regex matches your target capturing groups 1 and 2 as required:
describe\('([^']*).*?function\s*\(\)\s*\{(([^{]*\{[^}]*\})*[^}]*)\}
This will handle any number of non-nested curly-bracketed input in the body of the function.
See live demo.

Any suggestions to match and extract the pattern?

I want to match something like this
$(string).not(string).not(string)
The not(string) can repeat zero or more times, after $(string).
Note that the string can be whatever things, except nested not(string).
I used the regular expression (\\$\\((.*)\\))((\\.not\\((.*?)\\))*?)(?!(\\.not)), I think the *? is to non-greedily match any number of sequence of not(string), and use the lookahead to stop the match that is not not(string), so that I can extract only the part that I want.
However, when I tested on the input like
$(string).not(string).not(string).append(string)
the group(0) returns the whole string, which I only need $(string).not(string).not(string).
Obviously I still miss something or misuse of anything, any suggestions?
Try this one (escaped for java):
(\\$\\(string\\)(?:(?:\\.not\(.*?\\))+))
It should capture just the part that you are after. You can test it out (unescaped for java though)
If we assume that parenthesis are not nested, you can write something like this:
string p = "\\$\\([^)]*\\)(?:\\.not\\([^)]*\\))*";
Not need to add a lookahead since the non-capturing group has a greedy quantifier (so the group is repeated as possible).
if what you called string in your question may be a quoted string with parenthesis inside like in Pshemo example: $(string).not(".not(foo)").not(string), you can replace each [^)]* with (?:\\s*\"[^\"]*\"\\s*|[^)]*) to ignore characters inside quoted parts.
From here, "group zero denotes the entire pattern". Use group(1).
(\$\([\w ]+\))(\.not\([\w ]+\))*
This will also work, it would give you two groups, One consisting of the word with $ sign, another would give you the set of all ".not" strings.
Please note: You might have to add escape characters for java.

Match text with possible brackets between brackets

I need to match text between ${ and }
Example:
${I need to match this text}
Simple regex \$\\{(.+?)\\} will work fine until I place some of } inside the text
Curly brackets are paired inside the text to match.
Is there any possibility to solve this by means of Regular Expressions?
\$\{((?:\{[^\{\}]*\}|[^\{\}]*)*)\}
If we meet an opening bracket, we look for its pair, and after the closing one we proceed as usual. This can't handle more than one level of nested brackets.
The main building block here in [^\{\}]* - any non-bracket sequence. It can be surrounded by brackets \{[^\{\}]*\} but it might be not (?:\{[^\{\}]*\}|[^\{\}]*). Any count of these sequences can be present, hence * at the end.
Any level of nesting might require a recursive regex, not supported by Java. But any fixed amount can be matched by carefully extending this idea.
Add a $ to end of the ReGex and don't escape it. The dollar sign means it'll check for the previous letter or symbol at the very end.
ReGex: \${(.+?)}$
Java Formatted: \\${(.+?)}$

Java - Unknown characters passing as [a-zA-z0-9]*?

I'm no expert in regex but I need to parse some input I have no control over, and make sure I filter away any strings that don't have A-z and/or 0-9.
When I run this,
Pattern p = Pattern.compile("^[a-zA-Z0-9]*$"); //fixed typo
if(!p.matcher(gottenData).matches())
System.out.println(someData); //someData contains gottenData
certain spaces + an unknown symbol somehow slip through the filter (gottenData is the red rectangle):
In case you're wondering, it DOES also display Text, it's not all like that.
For now, I don't mind the [?] as long as it also contains some string along with it.
Please help.
[EDIT] as far as I can tell from the (very large) input, the [?]'s are either white spaces either nothing at all; maybe there's some sort of encoding issue, also perhaps something to do with #text nodes (input is xml)
The * quantifier matches "zero or more", which means it will match a string that does not contain any of the characters in your class. Try the + quantifier, which means "One or more": ^[a-zA-Z0-9]+$ will match strings made up of alphanumeric characters only. ^.*[a-zA-Z0-9]+.*$ will match any string containing one or more alphanumeric characters, although the leading .* will make it much slower. If you use Matcher.lookingAt() instead of Matcher.matches, it will not require a full string match and you can use the regex [a-zA-Z0-9]+.
You have an error in your regex: instead of [a-zA-z0-9]* it should be [a-zA-Z0-9]*.
You don't need ^ and $ around the regex.
Matcher.matches() always matches the complete string.
String gottenData = "a ";
Pattern p = Pattern.compile("[a-zA-z0-9]*");
if (!p.matcher(gottenData).matches())
System.out.println("doesn't match.");
this prints "doesn't match."
The correct answer is a combination of the above answers. First I imagine your intended character match is [a-zA-Z0-9]. Note that A-z isn't as bad as you might think it include all characters in the ASCII range between A and z, which is the letters plus a few extra (specifically [,\,],^,_,`).
A second potential problem as Martin mentioned is you may need to put in the start and end qualifiers, if you want the string to only consists of letters and numbers.
Finally you use the * operator which means 0 or more, therefore you can match 0 characters and matches will return true, so effectively your pattern will match any input. What you need is the + quantifier. So I will submit the pattern you are most likely looking for is:
^[a-zA-Z0-9]+$
You have to change the regexp to "^[a-zA-Z0-9]*$" to ensure that you are matching the entire string
Looks like it should be "a-zA-Z0-9", not "a-zA-z0-9", try correcting that...
Did anyone consider adding space to the regex [a-zA-Z0-9 ]*. this should match any normal text with chars, number and spaces. If you want quotes and other special chars add them to the regex too.
You can quickly test your regex at http://www.regexplanet.com/simple/
You can check input value is contained string and numbers? by using regex ^[a-zA-Z0-9]*$
if your value just contained numberString than its show match i.e, riz99, riz99z
else it will show not match i.e, 99z., riz99.z, riz99.9
Example code:
if(e.target.value.match('^[a-zA-Z0-9]*$')){
console.log('match')
}
else{
console.log('not match')
}
}
online working example

Categories