Replace string inside tags? - java

I want to replace a content inside some tags, eg:
<p>this it to be replaced</p>
I could extract the content between with groups like this, but can i actually replace the group?
str = str.replaceAll("<p>([^<]*)</p>", "replacement");

You can use lookaround (positive lookahead and lookbehind) for this:
Change the regex to: "(?<=<p>)(.*?)(?=</p>)" and you will be fine.
Example:
String str = "<p>this it to be replaced</p>";
System.out.println(str.replaceAll("(?<=<p>)(.*?)(?=</p>)", "replacement"));
Output:
<p>replacement</p>
Note however that if you are parsing HTML you should be using some kind of a HTML parser, often regular expressions is not good enough...

Change the regex to this:
(?<=<p>).*?(?=</p>)
ie
str = str.replaceAll("(?<=<p>).*?(?=</p>)", "replacement");
This uses a "look behind" and a "look ahead" to assert, but not capture, input before/after the matching (non-greedy) regex
Just in case anyone is wondering, this answer is different to dacwe's: His uses unnecessary brackets. This answer is the more elegant :)

Related

Java replaceAll with regex only replacing first instance

I am trying to update all url's in a CSS string and my regex only seems to get the first one. I want to get anything like:
url("file")
url('file');
url(file);
I also want to exclude things where the url is data:
url("data: ...");
url('data: ...');
url(data: ...);
I wrote some code to do this, but it only replaces the first one:
String str = ".ff0{font-family:sans-serif;visibility:hidden;}#font-face{font-family:ff1;src:url(f1.woff)format(\"woff\");}.ff1{font-family:ff1;line-height:1.330566;font-style:normal;font-weight:normal;visibility:visible;}#font-face{font-family:ff2;src:url(f2.woff)format(\"woff\");}.ff2{font-family:ff2;line-height:1.313477;font-style:normal;font-weight:normal;visibility:visible;}#font-face{font-family:ff3;src:url(f3.woff)format(\"woff\");}.ff3{font-family:ff3;line-height:1.386719;font-style:normal;font-weight:normal;visibility:visible;}#font-face{font-family:ff4;src:url()format(\"woff\");";
str = str.replaceAll("url\\((['\"]?)(?!data)(.*)\\1\\)","url(someURL/$2)");
out.println(str);
Any ideas on how to fix? I imagine it has something to do with the regex.
You probably want to use non-greedy quantifier (*? instead of *).
To exclude the data entries properly, also use possessive quantifier for capturing the qoutes: ?+ instead of ?.
So your regex should look as follows:
url\((['"]?+)(?!data)(.*?)\1\)
Note that you should probably escape some characters with extra slash as you did in your example.
Your .* is greedy. It's capturing to the end of the string. Use .*?, instead, which will force the engine to capture as few characters as possible.
str = str.replaceAll("url\\((['\"]?)(?!data)(.*?)\\1\\)","url(someURL/$2)");
Something like this should work:
~\((?!.*data).+\)~

Java: Regex to extract all non numerals AND leading +, if any

Here's a string method I'm using in Java to remove all non-numerals from a given string:
replaceAll("[^\\d.]", "")
Here's an example of what this would return:
Original string: $%^&*+89896 89#6
New string: 89896896
However, now I need to also retain the leading '+' sign if one exists such as in the case of the above illustration (thus, the new string should be +89896896). If this were PHP, I could have simply used a preg function with (^\+)|([\d]+) to get precisely the results I want. I am not sure how to implement it in Java (Android) though.
I came up with this
replaceAll("([^\+])([\D]+)", "")
But the results seem to be slightly distorted. Here's one test result:
Original string: +u +00786uy769+&jh6ghj765765
New string: +007876765765
Desired result: +007867696765765
What am I doing wrong with my expression?
P.S. I would like to avoid using the Pattern and Matcher classes unless that were the only way out.
Use a negative lookbehind based regex in string.replaceAll function.
string.replaceAll("(?<!^)\\+|[^\\d+]", "");
DEMO
If you don't want to remove dot then add dot inside the character class.
string.replaceAll("(?<!^)\\+|[^\\d+.]", "");
(?<!^)\\+ would match all the plus symbols except the one at the start.
You can use:
str = str.replaceAll("\\+(?!\\d)|[^\\d.+]", "");
RegEx Demo
\\+(?!\\d) will avoid matching + is it is followed by a digit.

Strip out every occurrence using regex

I want to strip out every occurrence of (title) from a string like below. How do I write a regex for that? I tried a regex like below but it doesn't work.
String ruler1="115.28(54)(title) is renumbered 115.363(title) and amended to read:";
Pattern rulerPattern1 = Pattern.compile("(.*)\\(title\\)(.*)", Pattern.MULTILINE);
System.out.println(rulerPattern1.matcher(ruler1).replaceAll(""));
The regex is much simpler than that - all you need is to escape parentheses, like this:
\\(title\\)
You do not need to use the Pattern class explicitly, because replaceAll takes a regular expression.
String ruler1="115.28(54)(title) is renumbered 115.363(title) and amended to read:";
String result = ruler1.replaceAll("\\(title\\)", "");
Your pattern replaces everything in a string that contains "(title)"
Here is a demo on ideone.
Just use what String has to offer:
System.out.println(ruler1.replace("(title)", ""));
DO NOT be fooled by its name vs .replaceAll(), it is very misleading:
.replace() does NOT use regexes;
.replace() DOES replace all occurrences.
Given what you need to do, it is a perfect fit. Javadoc for .replace()
I don't think regex is a great solution for something so simple. Try StringUtils.replace() from the Apache commons-lang package.
String result = StringUtils.replace(ruler1,"(title)","");

Java Regex doesn't work with special chars

I got a problem with my parser. I want to read an image-link on a webiste and this normally works fine. But today I got a link that contains special chars and the usual regex did not work.
This is how my code looks like.
Pattern t = Pattern.compile(regex.trim());
Matcher x = t.matcher(content[i].toString());
if(x.find())
{
values[i] = x.group(1);
}
And this is the part of html, that causes trouble
<div class="open-zoomview zoomlink" itemscope="" itemtype="http://schema.org/Product">
<img class="zoomLink productImage" src="
http://tnm.scene7.com/is/image/TNM/template_335x300?$plus_335x300$&$image=is{TNM/1098845000_prod_001}&$ausverkauft=1&$0prozent=1&$versandkostenfrei=0" alt="Produkt Atika HB 60 Benzin-Heckenschere" title="Produkt Atika HB 60 Benzin-Heckenschere" itemprop="image" />
</div>
And this is the regex I am using to get the part in the src-attribute:
<img .*src="(.*?)" .*>
I believe that it has something to do with all the special character inside the link. But I'm not sure how to escape all of them. I Already tried
Pattern.quote(content[i].toString())
But the outcome was the same: nothing found.
The . character usually only matches everything except new line characters. Therefore, your pattern won't match if there are newlines in the img-tag.
Use Pattern.compile(..., Pattern.DOTALL) or prepend your pattern with (?s).
In dotall mode, the expression . matches any character, including a
line terminator. By default this expression does not match line
terminators.
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html#DOTALL
You regex should be like:
String regex = "<img .*src=\"(.*?)\" .*>";
This probably caused by the newline within the tag. The . character won't match it.
Did you consider not using regex to parse HTML? Using regex for HTML parsing is notoriously fragile construct. Please consider using a parsing library such as JSoup for this.
You should actually use <img\\s\\.*?\\bsrc=["'](\\.*?)["']\\.*?> with (?s) modifier.

java Regex - split but ignore text inside quotes?

using only regular expression methods, the method String.replaceAll and ArrayList
how can i split a String into tokens, but ignore delimiters that exist inside quotes?
the delimiter is any character that is not alphanumeric or quoted text
for example:
The string :
hello^world'this*has two tokens'
should output:
hello
worldthis*has two tokens
I know there is a damn good and accepted answer already present but I would like to add another regex based (and may I say simpler) approach to split the given text using any non-alphanumeric delimiter which not inside the single quotes using
Regex:
/(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+/
Which basically means match a non-alphanumeric text if it is followed by even number of single quotes in other words match a non-alphanumeric text if it is outside single quotes.
Code:
String string = "hello^world'this*has two tokens'#2ndToken";
System.out.println(Arrays.toString(
string.split("(?=(([^']+'){2})*[^']*$)[^a-zA-Z\\d]+"))
);
Output:
[hello, world'this*has two tokens', 2ndToken]
Demo:
Here is a live working Demo of the above code.
Use a Matcher to identify the parts you want to keep, rather than the parts you want to split on:
String s = "hello^world'this*has two tokens'";
Pattern pattern = Pattern.compile("([a-zA-Z0-9]+|'[^']*')+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
See it working online: ideone
You cannot in any reasonable way. You are posing a problem that regular expressions aren't good at.
Do not use a regular expression for this. It won't work. Use / write a parser instead.
You should use the right tool for the right task.

Categories