Regex: how to substitute a string with n occurrences of a substring - java

As a premise, I have an HTML text, with some <ol> elements. These have a start attribute, but the framework I'm using is not capable to interpret them during a PDF conversion. So, the trick I am trying to apply is to add a number of invisible <li> elements at the beginning.
As an example, suppose this input text:
<ol start="3">
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
I want to produce this result:
<ol>
<li style="visibility:hidden"></li>
<li style="visibility:hidden"></li>
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
So, adding n-1 invisible elements into the ordered list.
But I'm not able to do that from Java in a generalized way.
Supposing the exact case in the example, I could do this (using replace, so - to be honest - without regex):
htmlString = htmlString.replace("<ol start=\"3\">",
"<ol><li style=\"visibility:hidden\"></li><li style=\"visibility:hidden\"></li>");
But, obviously, it just applies to the case with "start=3". I know that I can use groups to extract the "3", but how can I use it as a "variable" to specify the string <li style=\"visibility:hidden\"></li> n-1 number of times?
Thanks for any insight.

You cannot do this using regular expressions, or even if you find some hack to do this it's going to be a suboptimal solution..
The right way to do this is to use an HTML parsing library (e.g. Jsoup) and then add the <li> tags as children to the <ol>, specifically using the Element#prepend method. (With Jsoup you can also read the start attribute value in order to compute how many elements to add)

Since Java 9, there's a Matcher.replaceAll method taking a callback function as a parameter:
String text = "<ol start=\"3\">\n\t<li>Element 1</li>\n\t<li>Element 2</li>\n\t<li>Element 3</li>\n</ol>";
String result = Pattern
.compile("<ol start=\"(\\d)\">")
.matcher(text)
.replaceAll(m -> "<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />",
Integer.parseInt(m.group(1))-1));
To repeat the string you can take the trick from here, or use a loop.
public static String repeat(String s, int n) {
return new String(new char[n]).replace("\0", s);
}
Afterwards, result is:
<ol>
<li style="visibility:hidden" />
<li style="visibility:hidden" />
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
If you are stuck with an older version of Java, you can still match and replace in two steps.
Matcher m = Pattern.compile("<ol start=\"(\\d)\">").matcher(text);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
text = text.replace("<ol start=\"" + n + "\">",
"<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />", n-1));
}
Update by Andrea ジーティーオー:
I modified the (great) solution above for including also <ol> that have multiple attributes, so that their tag do not end with start (example, <ol> with letters, as <ol start="4" style="list-style-type: upper-alpha;">). This uses replaceAll to deal with regex as a whole.
//Take something that starts with "<ol start=", ends with ">", and has a number in between
Matcher m = Pattern.compile("<ol start=\"(\\d)\"(.*?)>").matcher(htmlString);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
htmlString = htmlString.replaceAll("(<ol start=\"" + n + "\")(.*?)(>)",
"<ol $2>" + StringUtils.repeat("\n\t<li style=\"visibility:hidden\" />", n - 1));
}

Using Jsoup you can write something like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class JsoupTest {
public static void main(String[] args){
String html = "<ol start=\"3\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
"</ol>"
+ "<p>some other html elements</p>"
+ "<ol start=\"5\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
" <li>Element 4</li>\n" +
" <li>Element 5</li>\n" +
"</ol>";
Document doc = Jsoup.parse(html);
Elements ols = doc.select("ol");
for(Element ol :ols){
int start = Integer.parseInt(ol.attr("start"));
for(int i=0; i<start-1; i++){
ol.prependElement("li").attr("style", "visibility:hidden");
}
ol.attributes().remove("start");
System.out.println(ol);
}
}
}

You can try this.
String input="<ol start=\"6\">"+
"<li>Element 1</li>"+
"<li>Element 2</li>"+
"<li>Element 3</li>"+
"<li>Element 4</li>"+
"<li>Element 5</li>"+
"<li>Element6</li>"+
"</ol>";
Matcher match= Pattern.compile("<ol .*start.*=.*\\\"(.*)\\\"\\s*>(.*)(</ol>)").matcher(input);
String resultString ="";
if(match.find()){
resultString =match.replaceAll("<ol>"+new String(new char[Integer.valueOf(match.group(1))-1]).replace("\0", "\n\t<li style=\"visibility:hidden\" />")+"$2$3");
}

Please use java Matcher and Pattern to count the occurrence of li tag and use StringBuilder insert method to insert invisible elements.
Matcher m = Pattern.compile("<li>").matcher(s);
while(m.find()){
++count;
}

Related

ReplaceAll Regex: Update group before Replacing

I'm using regex to extract some values in groups and put those values into another section of my new string but I need to make another change to a captured group before the replaceAll executes, I have this code:
String regex = "<button data-key=\"([^\"]*)([^<]*)</button>";
while ((strLine = br.readLine()) != null) {
String newStr = strLine.replaceAll(regex, "<button data-key=\"$1$2<span>&#x$1</span></button>");
}
This works OK extracting the data-key (Group1 / $1) value into span tag when value is simple (no "-" in it), but when data-key is i.e. 1f1e8-1f1e6 the value is extracted like this: &#x1f1e8-1f1e6, so I was thinking if it's possible to do something like this:
String newStr =
strLine.replaceAll(regex,
"<button data-key=\"$1$2<span>&#x" + "$1".replaceAll("-", "&#x") + "</span></button>");
replace "-" -> "&#x" for $1 in the replaceAll function but no success so far, do I need to change to Matchers? or any help on best approach for this scenario would be appreciated, thanks.
Edit1:
CURRENT:
<button data-key="1f1e8-1f1e8-1f1e8"><span>&#x1f1e8-1f1e8-1f1e8</span></button><button data-key="1f1e8-1f1e9"><span>&#x1f1e8-1f1e9</span></button>
EXPECTED:
<button data-key="1f1e8-1f1e8-1f1e8"><span>&#x1f1e8&#x1f1e8&#x1f1e8</span></button><button data-key="1f1e8-1f1e9"><span>&#x1f1e8&#x1f1e9</span></button>
Edit2:
INPUT:
<button data-key="1f1e8-1f1e8-1f1e8"></button>
<button data-key="1f1e8-1f1e9"></button>
Edit3:
WHOLE INPUT:
<div>
<h3>GG</h3>
<div class="ep-categoryItems">
<button class="ep-item" data-key="1f1e8-1f1e8-1f1e8" title="Grinning face" style="background-image: url('${cdn}/images/emoji/f1e8-1f1e8-1f1e8.png');"></button>
<button class="ep-item" data-key="1f1e8-1f1e9" title="Grinning face" style="background-image: url('${cdn}/images/emoji/1f1e8-1f1e9.png');"></button>
</div>
<div
UPDATE: Changed to work in Java 8 and with new input.
Also fixed to add the missing ;
It can be done like this:
String input = "<button data-key=\"1f1e8-1f1e8-1f1e8\"></button><button data-key=\"1f1e8-1f1e9\"> TO BE REPLACED </button>";
String regex = "(<button data-key=\"([^\"]+)\">).*?</button>";
StringBuffer buf = new StringBuffer();
Matcher m = Pattern.compile(regex).matcher(input);
while (m.find())
m.appendReplacement(buf, m.group(1) + "<span>" + m.group(2).replaceAll("-?([0-9a-fA-F]+)", "&#x$1;") + "</span></button>");
String output = m.appendTail(buf).toString();
System.out.println(input);
System.out.println(output);
Output
<button data-key="1f1e8-1f1e8-1f1e8"></button><button data-key="1f1e8-1f1e9"> TO BE REPLACED </button>
<button data-key="1f1e8-1f1e8-1f1e8"><span>🇨🇨🇨</span></button><button data-key="1f1e8-1f1e9"><span>🇨🇩</span></button>

How to grab HTML tags as well as the text between them and store in an object

I am working on a selneium-appium-java mobile web automation framework. I have a cucumber test that uses regex to accept some text and pass it further on as a parameter such as:
#Given("^user checks text \"([^\"]*)\" in footer$")
public void checkFooter(String footerText) {
footerComponent.checkNote(footerText);
}
Here is how it currently is set up for finding basic text of a node in the FooterComponent class
private final String FOOTER = "//div[contains(#class, 'footer')]";
public void checkNote(String expectedText) {
By note = By.xpath(FOOTER + "//div[#class='footer-footnote']");
String actualText = getDriver().findElement(footerText).getText();
assertEquals(actualText, expectedText, "Unexpected footer note");
}
Example of DOM that i need to validate expected result against:
<div class='footer'>
text1
<span class="copysymbol"></span>
text2
<span class="dot"></span>
text3
<span class="dot"></span>
text4
<span class="dot"></span>
</div>
Ive tried using the pattern from here but ive not been successful:
https://alvinalexander.com/blog/post/java/how-extract-html-tag-string-regex-pattern-matcher-group
So basically i need to insert some text that checks for the presence of the tags (which represent special characters i need to check for) AND the text between them in the cucumber line and then have the java method check for the actual code by finding it with Xpath. Is there a way this can be done using regex via cucumber?
I like to provide a stub as an answer since I have to agree that XPath is most appropriate than regex here.
Plus, if someone here gives you a complicated regex that does everything you want but is not maintainable by you.. what have you gained?
The following pattern matches the whole footer div. I cannot do much more since your description contains only one example and no variations.
<div class='footer'>.*?<span class="copysymbol"><\/span>.*?<span class="dot"><\/span>.*?<span class="dot"><\/span>.*?<span class="dot"><\/span>\s*<\/div>
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "<div class='footer'>.*?<span class=\"copysymbol\"><\\/span>.*?<span class=\"dot\"><\\/span>.*?<span class=\"dot\"><\\/span>.*?<span class=\"dot\"><\\/span>\\s*<\\/div>";
final String string = "<div class='footer'>\n"
+ "text1\n"
+ "<span class=\"copysymbol\"></span>\n"
+ "text2\n"
+ "<span class=\"dot\"></span>\n"
+ "text3\n"
+ "<span class=\"dot\"></span>\n"
+ "text4\n"
+ "<span class=\"dot\"></span>\n"
+ "</div>";
final Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

How to find the html element of a given text

Assume I have the following code to be parsed using JSoup
<body>
<div id="myDiv" class="simple" >
<p>
<img class="alignleft" src="myimage.jpg" alt="myimage" />
I just passed out of UC Berkeley
</p>
</div>
</body>
The question is, given just a keyword "Berkeley", is there a better way to find the element/XPath (or a list of it, if multiple occurrences of the keyword is present) in the html, which has this keyword as part of its text.
I don't get to see the html before hand, and will be available only at runtime.
My current implementation - Using Java-Jsoup, iterate through the children of body, and get "ownText" and text of each children, and then drill down into their children to narrow down the html element. I feel this is very slow.
Not elegant but simple way could look like :
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<body> \n" +
" <div id=\"myDiv\" class=\"simple\" >\n" +
" <p>\n" +
" <img class=\"alignleft\" src=\"myimage.jpg\" alt=\"myimage\" />\n" +
" I just passed out of UC Berkeley\n" +
" </p>\n" +
" <ol>\n" +
" <li>Berkeley</li>\n" +
" <li>Berkeley</li>\n" +
" </ol>\n" +
" </div> \n" +
"</body>";
Elements eles = Jsoup.parse(html).getAllElements(); // get all elements which apear in your html
Set<String> set = new HashSet<>();
for(Element e : eles){
Tag t = e.tag();
set.add(t.getName()); // put the tag name in a set or list
}
set.remove("head"); set.remove("html"); set.remove("body"); set.remove("#root"); set.remove("img"); //remove some unimportant tags
for(String s : set){
System.out.println(s);
if(!Jsoup.parse(html).select(s+":contains(Berkeley)").isEmpty()){ // check if the tag contains your key word
System.out.println(Jsoup.parse(html).select(s+":contains(Berkeley)").get(0).toString());} // print it out or do something else
System.out.println("---------------------");
System.out.println();
}
}
}
Try this xpath :
for the first element with a class :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#class]'
for the first element with an id :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#id]'
Check normalize-space

Modify large string

I have a large string in the following format -
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
<a href="12345.html"><a href="12345.html"><a href="12345.html"><a href="12345.html">
Id like to store all occurances of the value that occurs before .html. So above html becomes something like 12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
Do I need a regular expression? or some kind of replace method.
Thanks
You don't actually need a regular expression, but you could use the underlying Matcher class:
final String searchString = "12345.html";
final String txt =
"<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">\n"
+ "<a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\"><a href=\"12345.html\">";
final Matcher matcher = Pattern.compile(searchString, Pattern.LITERAL).matcher(txt);
final StringBuilder sb = new StringBuilder();
while(matcher.find()){
if(sb.length() > 0) sb.append(',');
sb.append(matcher.group());
}
System.out.println(sb.toString());
Output:
12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html,12345.html
You can use an HTML parser like Jsoup.
Document doc = Jsoup.parse(yourString);
Elements els = doc.select("a");
for(Element el: els){
//this only if needs the number without the HTML
//if not, only el.attr("href")
if(el.attr("href").contains(".html")){
String[] parts = el.attr("href").split(".html");
System.out.println(parts[0]);
}
}
Don't use regex to parse HTML.
If you are accessing this string inside the java code, you can split the string on "=' delimeter. It will result in a bunch of strings. One string will look like "
So the steps are:
1. split the string which will result in string array.
2. Iterate over the resulting array and look for the pattern ">

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories