ReplaceAll Regex: Update group before Replacing - java

I'm using regex to extract some values in groups and put those values into another section of my new string but I need to make another change to a captured group before the replaceAll executes, I have this code:
String regex = "<button data-key=\"([^\"]*)([^<]*)</button>";
while ((strLine = br.readLine()) != null) {
String newStr = strLine.replaceAll(regex, "<button data-key=\"$1$2<span>&#x$1</span></button>");
}
This works OK extracting the data-key (Group1 / $1) value into span tag when value is simple (no "-" in it), but when data-key is i.e. 1f1e8-1f1e6 the value is extracted like this: &#x1f1e8-1f1e6, so I was thinking if it's possible to do something like this:
String newStr =
strLine.replaceAll(regex,
"<button data-key=\"$1$2<span>&#x" + "$1".replaceAll("-", "&#x") + "</span></button>");
replace "-" -> "&#x" for $1 in the replaceAll function but no success so far, do I need to change to Matchers? or any help on best approach for this scenario would be appreciated, thanks.
Edit1:
CURRENT:
<button data-key="1f1e8-1f1e8-1f1e8"><span>&#x1f1e8-1f1e8-1f1e8</span></button><button data-key="1f1e8-1f1e9"><span>&#x1f1e8-1f1e9</span></button>
EXPECTED:
<button data-key="1f1e8-1f1e8-1f1e8"><span>&#x1f1e8&#x1f1e8&#x1f1e8</span></button><button data-key="1f1e8-1f1e9"><span>&#x1f1e8&#x1f1e9</span></button>
Edit2:
INPUT:
<button data-key="1f1e8-1f1e8-1f1e8"></button>
<button data-key="1f1e8-1f1e9"></button>
Edit3:
WHOLE INPUT:
<div>
<h3>GG</h3>
<div class="ep-categoryItems">
<button class="ep-item" data-key="1f1e8-1f1e8-1f1e8" title="Grinning face" style="background-image: url('${cdn}/images/emoji/f1e8-1f1e8-1f1e8.png');"></button>
<button class="ep-item" data-key="1f1e8-1f1e9" title="Grinning face" style="background-image: url('${cdn}/images/emoji/1f1e8-1f1e9.png');"></button>
</div>
<div

UPDATE: Changed to work in Java 8 and with new input.
Also fixed to add the missing ;
It can be done like this:
String input = "<button data-key=\"1f1e8-1f1e8-1f1e8\"></button><button data-key=\"1f1e8-1f1e9\"> TO BE REPLACED </button>";
String regex = "(<button data-key=\"([^\"]+)\">).*?</button>";
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find())
m.appendReplacement(buf, m.group(1) + "<span>" + m.group(2).replaceAll("-?([0-9a-fA-F]+)", "&#x$1;") + "</span></button>");
String output = m.appendTail(buf).toString();
System.out.println(input);
System.out.println(output);
Output
<button data-key="1f1e8-1f1e8-1f1e8"></button><button data-key="1f1e8-1f1e9"> TO BE REPLACED </button>
<button data-key="1f1e8-1f1e8-1f1e8"><span>πŸ‡¨πŸ‡¨πŸ‡¨</span></button><button data-key="1f1e8-1f1e9"><span>πŸ‡¨πŸ‡©</span></button>

Related

Split string in java using regex by combing both look-ahead and look-behind

I want to split a string in java using a regular expression but I want to match it from forward and from behind also for not missing any of the string.
For example:
test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3"
I have the above string and expected output should be :
Expected Output:
test
<img border=\"0\" src=\"test\" />
hi
<img border=\"0\" src=\"test\" />
test3"
Below is what I have tried
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestParse {
private static final String IMG_S_LookBehind = "(?<=\\>)";
private static final String IMG_S_LookAHead = "(?=<img .*?\\>)";
static String test = "test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3";
static Pattern newPattern(String tag) {
return Pattern.compile(String.format("(<%s\\s*([^>]*)>)(.*)(</%s>)", tag, tag));
}
public static void main(String[] args) {
// Pattern re = newPattern("b");
// Matcher m = re.matcher(test);
//
// if (m.matches()) {
// for (int i = 0; i <= m.groupCount(); i++) {
// System.out.printf("[%d]: [%s]\n", i, m.group(i));
// }
// }
String[] split = test.split(IMG_S_LookAHead);
System.out.println(split);
}
}
OUTPUT:
test
<img border=\"0\" src=\"test\" />hi
<img border=\"0\" src=\"test\" /> test3"
I tried looking from behind too but somehow it fails to give me the expected output. Any clue on this will be appreciated.
I wouldn't approach this via a regex split, because it is difficult to phrase/detect boundaries between tags and non-tags etc. Instead, I would try to match either tags, or anything which is not a tag. Here is a working sample script:
String input = "test <img border=\"0\" src=\"test\" />hi<img border=\\\"0\\\" src=\\\"test\\\" /> test3";
String pattern = "<[^>]+>|((?!<[^>]+>).)*";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
while (m.find( )) {
System.out.println(m.group(0));
}
This prints:
test
<img border="0" src="test" />
hi
<img border=\"0\" src=\"test\" />
test3
Perhaps one portion of the regex needs to be explained:
((?!<[^>]+>).)*
This will match anything, so long as it does not encounter the start of a tag. The trick is called "tempered dot," because it is really just .* with a check at each step to make sure that a tag is not intersected.

Regex: how to substitute a string with n occurrences of a substring

As a premise, I have an HTML text, with some <ol> elements. These have a start attribute, but the framework I'm using is not capable to interpret them during a PDF conversion. So, the trick I am trying to apply is to add a number of invisible <li> elements at the beginning.
As an example, suppose this input text:
<ol start="3">
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
I want to produce this result:
<ol>
<li style="visibility:hidden"></li>
<li style="visibility:hidden"></li>
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
So, adding n-1 invisible elements into the ordered list.
But I'm not able to do that from Java in a generalized way.
Supposing the exact case in the example, I could do this (using replace, so - to be honest - without regex):
htmlString = htmlString.replace("<ol start=\"3\">",
"<ol><li style=\"visibility:hidden\"></li><li style=\"visibility:hidden\"></li>");
But, obviously, it just applies to the case with "start=3". I know that I can use groups to extract the "3", but how can I use it as a "variable" to specify the string <li style=\"visibility:hidden\"></li> n-1 number of times?
Thanks for any insight.
You cannot do this using regular expressions, or even if you find some hack to do this it's going to be a suboptimal solution..
The right way to do this is to use an HTML parsing library (e.g. Jsoup) and then add the <li> tags as children to the <ol>, specifically using the Element#prepend method. (With Jsoup you can also read the start attribute value in order to compute how many elements to add)
Since Java 9, there's a Matcher.replaceAll method taking a callback function as a parameter:
String text = "<ol start=\"3\">\n\t<li>Element 1</li>\n\t<li>Element 2</li>\n\t<li>Element 3</li>\n</ol>";
String result = Pattern
.compile("<ol start=\"(\\d)\">")
.matcher(text)
.replaceAll(m -> "<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />",
Integer.parseInt(m.group(1))-1));
To repeat the string you can take the trick from here, or use a loop.
public static String repeat(String s, int n) {
return new String(new char[n]).replace("\0", s);
}
Afterwards, result is:
<ol>
<li style="visibility:hidden" />
<li style="visibility:hidden" />
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
If you are stuck with an older version of Java, you can still match and replace in two steps.
Matcher m = Pattern.compile("<ol start=\"(\\d)\">").matcher(text);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
text = text.replace("<ol start=\"" + n + "\">",
"<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />", n-1));
}
Update by Andrea ジーティーγ‚ͺγƒΌ:
I modified the (great) solution above for including also <ol> that have multiple attributes, so that their tag do not end with start (example, <ol> with letters, as <ol start="4" style="list-style-type: upper-alpha;">). This uses replaceAll to deal with regex as a whole.
//Take something that starts with "<ol start=", ends with ">", and has a number in between
Matcher m = Pattern.compile("<ol start=\"(\\d)\"(.*?)>").matcher(htmlString);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
htmlString = htmlString.replaceAll("(<ol start=\"" + n + "\")(.*?)(>)",
"<ol $2>" + StringUtils.repeat("\n\t<li style=\"visibility:hidden\" />", n - 1));
}
Using Jsoup you can write something like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class JsoupTest {
public static void main(String[] args){
String html = "<ol start=\"3\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
"</ol>"
+ "<p>some other html elements</p>"
+ "<ol start=\"5\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
" <li>Element 4</li>\n" +
" <li>Element 5</li>\n" +
"</ol>";
Document doc = Jsoup.parse(html);
Elements ols = doc.select("ol");
for(Element ol :ols){
int start = Integer.parseInt(ol.attr("start"));
for(int i=0; i<start-1; i++){
ol.prependElement("li").attr("style", "visibility:hidden");
}
ol.attributes().remove("start");
System.out.println(ol);
}
}
}
You can try this.
String input="<ol start=\"6\">"+
"<li>Element 1</li>"+
"<li>Element 2</li>"+
"<li>Element 3</li>"+
"<li>Element 4</li>"+
"<li>Element 5</li>"+
"<li>Element6</li>"+
"</ol>";
Matcher match= Pattern.compile("<ol .*start.*=.*\\\"(.*)\\\"\\s*>(.*)(</ol>)").matcher(input);
String resultString ="";
if(match.find()){
resultString =match.replaceAll("<ol>"+new String(new char[Integer.valueOf(match.group(1))-1]).replace("\0", "\n\t<li style=\"visibility:hidden\" />")+"$2$3");
}
Please use java Matcher and Pattern to count the occurrence of li tag and use StringBuilder insert method to insert invisible elements.
Matcher m = Pattern.compile("<li>").matcher(s);
while(m.find()){
++count;
}

regex to replace to "" does not work - Java

I'm trying to replace all in a string with "" but the below does not seem to work.
str.replace("&nbsp","");
My string:
<img alt="" src="abc430.jpg" width="650" height="430" /> u seen hey hey hey
trying to get this output:
<img alt="" src="abc430.jpg" width="650" height="430" /> u seen hey hey hey
Now the replace code does replace & nbsp; with "" but still my output on the page is
u seen
link here
hey hey
It's not in one line
It works. You didn't assign the replaced string into str
str = str.replace(" ","");
Try to run this code from your end:
String str = "alex alex";
str = str.replace(" ","");
System.out.println(str);
Outputs:
alex alex

Modify large string

I have a large string in the following format -
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
Id like to store all occurances of the value that occurs before .html. So above html becomes something like 12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
Do I need a regular expression? or some kind of replace method.
Thanks
You don't actually need a regular expression, but you could use the underlying Matcher class:
final String searchString = "12345.html";
final String txt =
"<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">\n"
+ "<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">";
final Matcher matcher = Pattern.compile(searchString, Pattern.LITERAL).matcher(txt);
final StringBuilder sb = new StringBuilder();
while(matcher.find()){
if(sb.length() > 0) sb.append(',');
sb.append(matcher.group());
}
System.out.println(sb.toString());
Output:
12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
You can use an HTML parser like Jsoup.
Document doc = Jsoup.parse(yourString);
Elements els = doc.select("a");
for(Element el: els){
//this only if needs the number without the HTML
//if not, only el.attr("href")
if(el.attr("href").contains(".html")){
String[] parts = el.attr("href").split(".html");
System.out.println(parts[0]);
}
}
Don't use regex to parse HTML.
If you are accessing this string inside the java code, you can split the string on "=' delimeter. It will result in a bunch of strings. One string will look like "
So the steps are:
1. split the string which will result in string array.
2. Iterate over the resulting array and look for the pattern ">

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories