How to find the html element of a given text - java

Assume I have the following code to be parsed using JSoup
<body>
<div id="myDiv" class="simple" >
<p>
<img class="alignleft" src="myimage.jpg" alt="myimage" />
I just passed out of UC Berkeley
</p>
</div>
</body>
The question is, given just a keyword "Berkeley", is there a better way to find the element/XPath (or a list of it, if multiple occurrences of the keyword is present) in the html, which has this keyword as part of its text.
I don't get to see the html before hand, and will be available only at runtime.
My current implementation - Using Java-Jsoup, iterate through the children of body, and get "ownText" and text of each children, and then drill down into their children to narrow down the html element. I feel this is very slow.

Not elegant but simple way could look like :
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<body> \n" +
" <div id=\"myDiv\" class=\"simple\" >\n" +
" <p>\n" +
" <img class=\"alignleft\" src=\"myimage.jpg\" alt=\"myimage\" />\n" +
" I just passed out of UC Berkeley\n" +
" </p>\n" +
" <ol>\n" +
" <li>Berkeley</li>\n" +
" <li>Berkeley</li>\n" +
" </ol>\n" +
" </div> \n" +
"</body>";
Elements eles = Jsoup.parse(html).getAllElements(); // get all elements which apear in your html
Set<String> set = new HashSet<>();
for(Element e : eles){
Tag t = e.tag();
set.add(t.getName()); // put the tag name in a set or list
}
set.remove("head"); set.remove("html"); set.remove("body"); set.remove("#root"); set.remove("img"); //remove some unimportant tags
for(String s : set){
System.out.println(s);
if(!Jsoup.parse(html).select(s+":contains(Berkeley)").isEmpty()){ // check if the tag contains your key word
System.out.println(Jsoup.parse(html).select(s+":contains(Berkeley)").get(0).toString());} // print it out or do something else
System.out.println("---------------------");
System.out.println();
}
}
}

Try this xpath :
for the first element with a class :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#class]'
for the first element with an id :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#id]'
Check normalize-space

Related

Regex: how to substitute a string with n occurrences of a substring

As a premise, I have an HTML text, with some <ol> elements. These have a start attribute, but the framework I'm using is not capable to interpret them during a PDF conversion. So, the trick I am trying to apply is to add a number of invisible <li> elements at the beginning.
As an example, suppose this input text:
<ol start="3">
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
I want to produce this result:
<ol>
<li style="visibility:hidden"></li>
<li style="visibility:hidden"></li>
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
So, adding n-1 invisible elements into the ordered list.
But I'm not able to do that from Java in a generalized way.
Supposing the exact case in the example, I could do this (using replace, so - to be honest - without regex):
htmlString = htmlString.replace("<ol start=\"3\">",
"<ol><li style=\"visibility:hidden\"></li><li style=\"visibility:hidden\"></li>");
But, obviously, it just applies to the case with "start=3". I know that I can use groups to extract the "3", but how can I use it as a "variable" to specify the string <li style=\"visibility:hidden\"></li> n-1 number of times?
Thanks for any insight.
You cannot do this using regular expressions, or even if you find some hack to do this it's going to be a suboptimal solution..
The right way to do this is to use an HTML parsing library (e.g. Jsoup) and then add the <li> tags as children to the <ol>, specifically using the Element#prepend method. (With Jsoup you can also read the start attribute value in order to compute how many elements to add)
Since Java 9, there's a Matcher.replaceAll method taking a callback function as a parameter:
String text = "<ol start=\"3\">\n\t<li>Element 1</li>\n\t<li>Element 2</li>\n\t<li>Element 3</li>\n</ol>";
String result = Pattern
.compile("<ol start=\"(\\d)\">")
.matcher(text)
.replaceAll(m -> "<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />",
Integer.parseInt(m.group(1))-1));
To repeat the string you can take the trick from here, or use a loop.
public static String repeat(String s, int n) {
return new String(new char[n]).replace("\0", s);
}
Afterwards, result is:
<ol>
<li style="visibility:hidden" />
<li style="visibility:hidden" />
<li>Element 1</li>
<li>Element 2</li>
<li>Element 3</li>
</ol>
If you are stuck with an older version of Java, you can still match and replace in two steps.
Matcher m = Pattern.compile("<ol start=\"(\\d)\">").matcher(text);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
text = text.replace("<ol start=\"" + n + "\">",
"<ol>" + repeat("\n\t<li style=\"visibility:hidden\" />", n-1));
}
Update by Andrea ジーティーオー:
I modified the (great) solution above for including also <ol> that have multiple attributes, so that their tag do not end with start (example, <ol> with letters, as <ol start="4" style="list-style-type: upper-alpha;">). This uses replaceAll to deal with regex as a whole.
//Take something that starts with "<ol start=", ends with ">", and has a number in between
Matcher m = Pattern.compile("<ol start=\"(\\d)\"(.*?)>").matcher(htmlString);
while (m.find()) {
int n = Integer.parseInt(m.group(1));
htmlString = htmlString.replaceAll("(<ol start=\"" + n + "\")(.*?)(>)",
"<ol $2>" + StringUtils.repeat("\n\t<li style=\"visibility:hidden\" />", n - 1));
}
Using Jsoup you can write something like:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class JsoupTest {
public static void main(String[] args){
String html = "<ol start=\"3\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
"</ol>"
+ "<p>some other html elements</p>"
+ "<ol start=\"5\">\n" +
" <li>Element 1</li>\n" +
" <li>Element 2</li>\n" +
" <li>Element 3</li>\n" +
" <li>Element 4</li>\n" +
" <li>Element 5</li>\n" +
"</ol>";
Document doc = Jsoup.parse(html);
Elements ols = doc.select("ol");
for(Element ol :ols){
int start = Integer.parseInt(ol.attr("start"));
for(int i=0; i<start-1; i++){
ol.prependElement("li").attr("style", "visibility:hidden");
}
ol.attributes().remove("start");
System.out.println(ol);
}
}
}
You can try this.
String input="<ol start=\"6\">"+
"<li>Element 1</li>"+
"<li>Element 2</li>"+
"<li>Element 3</li>"+
"<li>Element 4</li>"+
"<li>Element 5</li>"+
"<li>Element6</li>"+
"</ol>";
Matcher match= Pattern.compile("<ol .*start.*=.*\\\"(.*)\\\"\\s*>(.*)(</ol>)").matcher(input);
String resultString ="";
if(match.find()){
resultString =match.replaceAll("<ol>"+new String(new char[Integer.valueOf(match.group(1))-1]).replace("\0", "\n\t<li style=\"visibility:hidden\" />")+"$2$3");
}
Please use java Matcher and Pattern to count the occurrence of li tag and use StringBuilder insert method to insert invisible elements.
Matcher m = Pattern.compile("<li>").matcher(s);
while(m.find()){
++count;
}

How to get anchor tag href and anchor tag text inside a div using Selenium in Java

My HTML code consists of multiple divs. Inside each div is a list of anchor tags. I need to fetch the href values and text values of the anchor tags that are in the sub-container div. I'm using Selenium to get the HTML code of the webpage.
HTML code:
<body>
<div id="main-container">
One
Two
Three
<div id="sub-container">
Abc
Xyz
Pqr
</div>
</div>
</body>
Java code:
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
Output:
a=www.one.com
a=www.two.com
a=www.three.com
a=www.abc.com
a=www.xyz.com
a=www.pqr.com
Output I need:
a=www.abc.com , Abc
a=www.xyz.com , Xyz
a=www.pqr.com , Pqr
Try this,
List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(element.getTagName() + "=" + link +", "+ element.getText());
}
You can use element.getText() to get the link text.
If you only want to select the links in the sub-container, you can adjust your xPath:
//*[#id="sub-container"]/a
Pretty simple, try as below:
`List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/a"));
for (WebElement element : list) {
String link = element.getAttribute("href");
String text = element.getText();
System.out.println(e.getTagName() + "=" + link + ", " + text);
}
if id sub-container is unique, just use the below line
driver.findElements(By.cssSelector("div#sub-container>a"));
thanks

JSoup search by attribute and class

You can do:
Elements links = doc.select("a[href]");
to find all "a" elements with an href attribute.
And you can do:
doc.getElementsByClass("title")
to get all elements with a class that is called "title"
But how can I do both? (I.e search for an "a" element with an "href" tag that also has the class "title").
You can simply have
Elements links = doc.select("a[href].title");
This will select all <a> having an href attribute with a title class. The class is passed by prepending it with a dot:
Selector combinations
Any combination, e.g. a[href].highlight
Full example:
public static void main(String[] args) {
Document doc = Jsoup.parse(""
+ "<div>"
+ " <a href='link1' class='title another'>Link 1</a>"
+ " <a href='link2' class='another'>Link 2</a>"
+ " <a href='link3'>Link 3</a>"
+ "</div>");
Elements links = doc.select("a[href].title");
System.out.println(links); // prints "Link 1"
}

Update a tag name along with its value

I am trying to replace html tags with updated values. I had tried using JSOUP but could not work out a way yet.
The functionality:
if (webText.contains("a href")) {
// Parse it into jsoup
Document doc = Jsoup.parse(webText);
// Create an array to tackle every type individually as wrap can
// affect whole body types otherwises.
Element[] array = new Element[doc.select("a").size()];
for (int i = 0; i < doc.select("a").size(); i++) {
if (doc.select("a").get(i) != null) {
array[i] = doc.select("a").get(i);
}
}
for (int i = 0; i < array.length; i++) {
if (array[i].toString().contains("http")) {
Log.e("Link", array[i].toString());
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(array[i].toString());
String url = null;
if (m.find()) {
url = m.group(1); // this variable should contain the link URL
Log.e("Link Value", url);
array[i] = array[i].wrap("<a href='"+url+"' class='link'></a>");
}
}
else {
Log.e("Favourite", array[i].toString());
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(array[i].toString());
String url = null;
if (m.find()) {
url = m.group(1); // this variable should contain the link URL
Log.e("Favourite Value", url);
array[i] = array[i].wrap("<a href='"+url+"' class='favourite'></a>");
//array[i] = array[i].replaceWithreplaceWith("","");
}
}
}
Element element = doc.body();
Log.e("From element html *************** ", " " + element.html());
String currentHtml = wrapImgWithCenter(element.html());
Log.e("currentHtml", currentHtml);
listOfElements = currentHtml;
}
This array[i] = array[i].wrap("<a href='"+url+"' class='favourite'></a>"); is basically wrapping the existing tags with the new value. But I do not want that to happen. I want to replace the tags completely with something like:
"<a href='"+url+"' class='favourite'>+url+"</a>";
Input:
<html>
<head></head>
<body>
<p dir="ltr"><font color="#009a49">Frank Frank</font> <font color="#0033cc">http://yahoo.co.in</font></p>
<br />
<br />
</body>
</html>
Expected output:
<html>
<head></head>
<body>
<p dir="ltr"><font color="#009a49">Frank Frank</font> <font color="#0033cc">http://yahoo.co.in</font></p>
<br />
<br />
</body>
</html>
I have tried using replaceWith but was unsuccessful. You can still find it commented out in the source code provided above. Please tell me where am I going wrong? What should I do to update the tags?
P.S.: The input might be variable with some more or less tags.
You can use the replaceWith method of class Element. I've cleared your code a little bit. Removed the arrays and used the provided lists wherever possible. Moreover you don't need regex to get the href attribute (or any other attribute for that matter) when you've already parsed the html. Check it out and inform me if you need further assistance.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) throws Exception {
String webText =
"<html>" +
"<head></head>" +
"<body>" +
"<p dir=\"ltr\">" +
"" +
"<font color=\"#009a49\">Frank Frank</font>" +
"" +
"<font color=\"#0033cc\">http://yahoo.co.in</font>" +
"</p>" +
"</body>" +
"</html>";
if (webText.contains("a href")) {
// Parse it into jsoup
Document doc = Jsoup.parse(webText);
Elements links = doc.select("a");
for (Element link : links) {
if (link.attr("href").contains("http")) {
System.out.println("Link: " + link.toString());
String url = link.attr("href");
if (url != null) {
System.out.println("Link Value: " + url);
Attributes attributes = new Attributes();
attributes.put("href", url);
attributes.put("class", "link");
link.replaceWith(new Element(Tag.valueOf("a"), "", attributes).insertChildren(0, link.childNodes()));
}
} else {
System.out.println("Favourite: " + link.toString());
String url = link.attr("href");
if (url != null) {
System.out.println("Favourite Value: " + url);
Attributes attributes = new Attributes();
attributes.put("href", url);
attributes.put("class", "favourite");
link.replaceWith(new Element(Tag.valueOf("a"), "", attributes).insertChildren(0, link.childNodes()));
}
}
}
Element element = doc.body();
System.out.println("From element html *************** "+ element.html());
}
}
}
Input
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Output
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Input
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Output
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>

Modifying HTML in Memory with JSoup

Recently I was recommended to use JSoup to parse and modify HTML documents.
However what if I have a HTML document that I want to modify (to send, store somewhere else, etc.), how might I go about doing that without changing the original document?
Say I have an HTML file like so:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: </p>
<p>Address: </p>
<p>Phone Number: </p>
</body>
</html>
And I want to fill in the appropriate data for Name, Address, Phone Number and any other information I'd like, without modifying the original HTML file, how might I go about that using JSoup?
A possible simpler solution is to modify your template to have placeholders like:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: <span id="name"></span></p>
<p>Address: <span id="address"></span></p>
<p>Phone Number: <span id="phone"></span></p>
</body>
</html>
Then load your document this way:
Document doc = Jsoup.parse("" +
"<html>\n" +
" <head></head>\n" +
" <body> \n" +
" <p></p>\n" +
" <h2>Title: title</h2>\n" +
" <p></p>\n" +
" <p>Name: <span id=\"name\"></span></p>\n" +
" <p>Address: <span id=\"address\"></span></p>\n" +
" <p>Phone Number: <span id=\"phone\"></span></p>\n" +
" </body>\n" +
"</html>");
doc.getElementById("name").text("Andrey");
doc.getElementById("address").text("Stackoverflow.com");
doc.getElementById("phone").text("secret!");
System.out.println(doc.html());
And this would give the form filled out.
#MarcoS had an excellent solution using a NodeTraversor to make a list of nodes to change at https://stackoverflow.com/a/6594828/1861357 and I only very slightly modified his method which replaces a node (a set of tags) with the data in the node plus whatever information you would like to add.
To store a String in memory I used a static StringBuilder to save the HTML in memory.
First we read in the HTML file (that is manually specified, this can be changed), then we make a series of checks to change whatever nodes with any data that we want.
The one problem that I didn't fix in the solution by MarcoS was that it split each individual word, instead of looking at a line. However I just used '-' for multiple words, because otherwise it places the string directly after that word.
So a full implementation:
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;
public class memoryHTML
{
static String htmlLocation = "C:\\Users\\User\\";
static String fileName = "blah"; // Just for demonstration, easily modified.
static StringBuilder buildTmpHTML = new StringBuilder();
static StringBuilder buildHTML = new StringBuilder();
static String name = "John Doe";
static String address = "42 University Dr., Somewhere, Someplace";
static String phoneNumber = "(123) 456-7890";
public static void main(String[] args)
{
// You can send it the full path with the filename. I split them up because I used this for multiple files.
readHTML(htmlLocation, fileName);
modifyHTML();
System.out.println(buildHTML.toString());
// You need to clear the StringBuilder Object or it will remain in memory and build on each run.
buildTmpHTML.setLength(0);
buildHTML.setLength(0);
System.exit(0);
}
// Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
public static void readHTML(String directory, String fileName)
{
try
{
BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));
String line;
while((line = br.readLine()) != null)
{
buildTmpHTML.append(line);
}
br.close();
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
// Excellent method of parsing and modifying nodes in HTML files by #MarcoS at https://stackoverflow.com/a/6594828/1861357
// It has its small problems, but it does the trick.
public static void modifyHTML()
{
String htmld = buildTmpHTML.toString();
Document doc = Jsoup.parse(htmld);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor()
{
#Override
public void tail(Node node, int depth)
{
if (node instanceof TextNode)
{
TextNode textNode = (TextNode) node;
nodesToChange.add(textNode);
}
}
#Override
public void head(Node node, int depth)
{
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange)
{
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
buildHTML.append(doc.html());
}
private static Node buildElementForText(TextNode textNode)
{
String text = textNode.getWholeText();
String[] words = text.trim().split(" ");
Set<String> units = new HashSet<String>();
for (String word : words)
units.add(word);
String newText = text;
for (String rpl : units)
{
if(rpl.contains("Name"))
newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
if(rpl.contains("Address") || rpl.contains("Residence"))
newText = newText.replaceAll(rpl, "" + rpl + " " + address);
if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
}
return new DataNode(newText, textNode.baseUri());
}
And you'll get this HTML back (remember I changed "Phone Number" to "Phone-Number"):
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: John Doe </p>
<p>Address: 42 University Dr., Somewhere, Someplace</p>
<p>Phone-Number: (123) 456-7890</p>
</body>
</html>

Categories