Note: This is a Java-only question (i.e. no Javascript, sed, Perl, etc.)
I need to filter out all the "reluctant" curly braces ({}) in a long string of text.
(by "reluctant" I mean as in reluctant quantifier).
I have been able to come up with the following regex which correctly finds and lists all such occurrences:
Pattern pattern = Pattern.compile("(\\{)(.*?)(\\})", Pattern.DOTALL);
Matcher matcher = pattern.matcher(originalString);
while (matcher.find()) {
Log.d("WITHIN_BRACES", matcher.group(2));
}
My problem now is how to replace every found matcher.group(0) with the corresponding matcher.group(2).
Intuitively I tried:
while (matcher.find()) {
String noBraces = matcher.replaceAll(matcher.group(2));
}
But that replaced all found matcher.group(0) with only the first matcher.group(2), which is of course not what I want.
Is there an expression or a method in Java's regex to perform this "corresponding replaceAll" that I need?
ANSWER: Thanks to the tip below, I have been able to come up with 2 fixes that did the trick:
if (matcher.find()) {
String noBraces = matcher.replaceAll("$2");
}
Fix #1: Use "$2" instead of matcher.group(2)
Fix #2: Use if instead of while.
Works now like a charm.
You can use the special backreference syntax:
String noBraces = matcher.replaceAll("$2");
Related
I want to match the pattern (including the square brackets, equals, quotes)
[fixedtext="sometext"]
What would be a correct regex expression?
Anything can occur inside quotes. 'fixedtext' is fixed.
Your basic solution (although I'd be skeptical of this, per the comments) is essentially:
"\\[fixedtext=\\\"(.*)\\\"\\]"
which resolves to:
"\[fixedtext=\"(.*)\"\]"
Simple escaping of [] and quotes. The (.*) says capture everything in quotes as a capture group (matcher.group(1)).
But if you had a string of, for example '[fixedtext="abc\"]def"]' you'd get the an answer of abc\ instead of abc\"]def.
If you know the ending bracket ends the line, then use:
"\\[fixedtext=\\\"(.*)\\\"\\]$"
(add the $ at the end to mark end of line) and that should be fairly reliable.
My suggestion is using named-capturing groups.
You can find more details here:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Here's an example for your input:
String input = "[fixedtext=\"sometext\"]";
Pattern pattern = Pattern.compile("\\[(?<field>.*)=\"(?<value>.*)\"]");
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
System.out.println(matcher.group("field"));
System.out.println(matcher.group("value"));
} else {
System.err.println(input + " doesn't match " + pattern);
}
I'm using Java and I would like to implement a code whose output is PRP I when the input is (NP (PRP I)).
My current implementation is like the following:
Pattern pattern = Pattern.compile("\\((.?)\\)");
Matcher matcher = pattern.matcher(noun_phrase);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
and its output is NP (PRP I.
I know that one possibility would be to count the parentheses, but I'm wondering if there is any way to get just the string inside the nested parentheses using regex.
This should work
Pattern pattern = Pattern.compile("\\(.*?\\((.*?)\\)\\)");
Matcher matcher = pattern.matcher("(NP (PRP I))");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
You can use following sites to experiment with Regular expressions.
https://regex101.com/r/cE0dM7/1
http://leaverou.github.io/regexplained/
https://www.debuggex.com/r/gfVglXkY1Cw5D6Mb
You need to add another braces around the group. Also, you need to make sure that between the fixed parentheses you don't match the parentheses:
String noun_phrase = "(NP (PRP I))";
Pattern pattern = Pattern.compile("\\([^(]*\\(([^)]*)\\)[^)]*\\)");
Matcher matcher = pattern.matcher(noun_phrase);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
The negated character classes [^(] and [^)] make sure you don't match parentheses too eagerly.
Well, as I don't know how deep you can go with your parantheses, I will suggest two possible solutions.
Solution 1: Assuming the depth's exactly as in your question.
This regex will work: Pattern pattern = Pattern.compile("\\(([^()]*)\\)").
Solution 2: Assuming the depths arbitrary (but at least the most inner string is surrounded by parantheses).
In this case, you will have to make some more changes. First, your pattern will look like this: Pattern pattern = Pattern.compile("(\\(.*)*\\(([^)]*)\\)"). See the difference? You now have two groups, the first matching on all but the innermost part surrounded by parantheses, the second group is exactly the one you want. That means, in your loop, you have to change matcher.group(1) to matcher.group(2). Furthermore, [^)] makes sure, you don't have any closing parantheses in your group.
I currently have this string:
"display_name":"test","game":"test123"
and I want to split the string so I can get the value test. I have looked all over the internet and tried some things, but I couldn't get it to work.
I found that splitting using quotation marks could be done using this regex: \"([^\"]*)\". So I tried this regex: display_name:\":\"([^\"]*)\"game\", but this returned null. I hope that someone could explain me why my regex didn't work and how it should be done.
You forget to include the ",comma before "game" and also you need to remove the extra colon after display_name
display_name\":\"([^\"]*)\",\"game\"
or
\"display_name\":\"([^\"]*)\",\"game\"
Now, print the group index 1.
DEMO
Matcher m = Pattern.compile("\"display_name\":\"([^\"]*)\",\"game\"").matcher(str);
while(m.find())
{
System.out.println(m.group(1))
}
I think you could do it easier, like this:
/(\w)+/g
This little regex will take all your strings.
Your java code should be something like:
Pattern pattern = Pattern.compile("(\w)+");
Matcher matcher = pattern.matcher(yourText);
while (matcher.find()) {
System.out.println("Result: " + matcher.group(2));
}
I also want to note as #AbishekManoharan noted that it looks like JSON
in the following, i need to get:
String regex = "Item#: <em>.*</em>";
String content = "xxx Item#: <em>something</em> yyy";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
if( matcher.find() ) {
System.out.println(matcher.group());
}
it will print:
Item#: <em>something</em>
but i just need the value "something".
i know i can use .substring(begin,end) to get the value,
but is there another way which would be more elegant?
It prints the whole string because you have printed it. matcher.group() prints the complete match. To get specific part of your matched string, you need to change your Regex to capture the content between the tag in a group: -
String regex = "Item#: <em>(.*?)</em>";
Also, use Reluctant quantifier(.*?) to match the least number of characters before an </em> is encountered.
And then in if, print group(1) instead of group()
if( matcher.find() ) {
System.out.println(matcher.group(1));
}
Anyways, you should not use Regex to parse HTML. Regex is not strong enough to achieve this task. You should probably use some HTML parser like - HTML Cleaner. Also see the link that is provided in one of the comments in the OP. That post is very nice explanation of the problems you can face.
I have strings like "xxxxx?434334", "xxx?411112", "xxxxxxxxx?11113" and so on.
How to substring properly to retrieve "xxxxx" (everything that comes untill '?' character)?
return s.substring(0, s.indexOf('?'));
No need for a regex for that.
If you have a problem, use a regex. Now you have two problems.
str = str.replaceAll("[?].*", "");
In other words, "remove everything after, and including, the question mark character". The ? has to be enclosed in square brackets because otherwise it has a special meaning.
I would agree with others answers that you should avoid using regex wherever possible, but if you did want to use it for this scenario you could use the following
Pattern regex = Pattern.compile("([^\\?]*)\\?{1}");
Matcher m = regex.matcher(str);
if (m.find()) {
result = m.group(1);
}
where str is your input string.
EDIT:
Description of regex match any group of characters that are not a "?" and have a single "?" after the group
The Pattern ".*(?=\?)" should work as well. ?= is a positive lookahead, which means the mattern matches everything that comes before a quotation mark, but not the quotation mark itself.