JSoup search by attribute and class - java

You can do:
Elements links = doc.select("a[href]");
to find all "a" elements with an href attribute.
And you can do:
doc.getElementsByClass("title")
to get all elements with a class that is called "title"
But how can I do both? (I.e search for an "a" element with an "href" tag that also has the class "title").

You can simply have
Elements links = doc.select("a[href].title");
This will select all <a> having an href attribute with a title class. The class is passed by prepending it with a dot:
Selector combinations
Any combination, e.g. a[href].highlight
Full example:
public static void main(String[] args) {
Document doc = Jsoup.parse(""
+ "<div>"
+ " <a href='link1' class='title another'>Link 1</a>"
+ " <a href='link2' class='another'>Link 2</a>"
+ " <a href='link3'>Link 3</a>"
+ "</div>");
Elements links = doc.select("a[href].title");
System.out.println(links); // prints "Link 1"
}

Related

How to get anchor tag href and anchor tag text inside a div using Selenium in Java

My HTML code consists of multiple divs. Inside each div is a list of anchor tags. I need to fetch the href values and text values of the anchor tags that are in the sub-container div. I'm using Selenium to get the HTML code of the webpage.
HTML code:
<body>
<div id="main-container">
One
Two
Three
<div id="sub-container">
Abc
Xyz
Pqr
</div>
</div>
</body>
Java code:
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
Output:
a=www.one.com
a=www.two.com
a=www.three.com
a=www.abc.com
a=www.xyz.com
a=www.pqr.com
Output I need:
a=www.abc.com , Abc
a=www.xyz.com , Xyz
a=www.pqr.com , Pqr
Try this,
List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(element.getTagName() + "=" + link +", "+ element.getText());
}
You can use element.getText() to get the link text.
If you only want to select the links in the sub-container, you can adjust your xPath:
//*[#id="sub-container"]/a
Pretty simple, try as below:
`List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/a"));
for (WebElement element : list) {
String link = element.getAttribute("href");
String text = element.getText();
System.out.println(e.getTagName() + "=" + link + ", " + text);
}
if id sub-container is unique, just use the below line
driver.findElements(By.cssSelector("div#sub-container>a"));
thanks

How to find the html element of a given text

Assume I have the following code to be parsed using JSoup
<body>
<div id="myDiv" class="simple" >
<p>
<img class="alignleft" src="myimage.jpg" alt="myimage" />
I just passed out of UC Berkeley
</p>
</div>
</body>
The question is, given just a keyword "Berkeley", is there a better way to find the element/XPath (or a list of it, if multiple occurrences of the keyword is present) in the html, which has this keyword as part of its text.
I don't get to see the html before hand, and will be available only at runtime.
My current implementation - Using Java-Jsoup, iterate through the children of body, and get "ownText" and text of each children, and then drill down into their children to narrow down the html element. I feel this is very slow.
Not elegant but simple way could look like :
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<body> \n" +
" <div id=\"myDiv\" class=\"simple\" >\n" +
" <p>\n" +
" <img class=\"alignleft\" src=\"myimage.jpg\" alt=\"myimage\" />\n" +
" I just passed out of UC Berkeley\n" +
" </p>\n" +
" <ol>\n" +
" <li>Berkeley</li>\n" +
" <li>Berkeley</li>\n" +
" </ol>\n" +
" </div> \n" +
"</body>";
Elements eles = Jsoup.parse(html).getAllElements(); // get all elements which apear in your html
Set<String> set = new HashSet<>();
for(Element e : eles){
Tag t = e.tag();
set.add(t.getName()); // put the tag name in a set or list
}
set.remove("head"); set.remove("html"); set.remove("body"); set.remove("#root"); set.remove("img"); //remove some unimportant tags
for(String s : set){
System.out.println(s);
if(!Jsoup.parse(html).select(s+":contains(Berkeley)").isEmpty()){ // check if the tag contains your key word
System.out.println(Jsoup.parse(html).select(s+":contains(Berkeley)").get(0).toString());} // print it out or do something else
System.out.println("---------------------");
System.out.println();
}
}
}
Try this xpath :
for the first element with a class :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#class]'
for the first element with an id :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#id]'
Check normalize-space

Jsoup not selector not returning result

Trying to use Jsoup selector to select everything in a div with class 'content', but at the same time not select any divs with class social,or media. I know I can do a simple select and loop, but would have expected the :not function to work for my purpose. Perhaps, my selector syntax is wrong.
public static void main(String args[]) throws ParseException {
String html = "<html>\n" +
"<body>\n" +
"<div class=\"content\">\n" +
"\t<p>some paragraph</p>\n" +
"\t<div class=\"social media\">\n" +
"\tfind us on facebook\n" +
"\t</div\n" +
"</div>\n" +
"</body>\n" +
"</html>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("div.content div:not(.social)");
System.out.println(elements.text());
}
Expected result: "some paragraph"
Actual result: null
Your selector as it is, matches divs that do not have class="social" and are childs of div with class="content". To have the expected outcome use this
Elements elements = doc.select("div.content :not(.social)");
Or this
Elements elements = doc.select("div.content").not(".social");

How to use JSoup to get hyperlink href?

I have the following jsFiddle
http://jsfiddle.net/B5zvV/
I am trying to use JSoup to obtain the value of the hyperlink's href string on Line 238:
<a href="/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450">
Hence, the desired result would be to obtain a String with a value of:
/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450
Here's my code:
Document doc = Jsoup.connect("http://myapp.example.com/fizz.html").get()
Elements elems = doc.getElementsByAttributeValueContaining("href", "repositoryId")
When I run this, the value of elems is empty: why, and what do I need to do to get the desired String?
The getElementsByAttributeValueContaining() method will return multiple values in this case because many hrefs has repositoryId. If you are particular about line 238 then that a is enclosed inside an li with class item item-default. There is only one such li and two a tags inside it. Just take the first one like
String html = "<li class=\"item item-default\" data-item-id=\"28049450\" id=\"item-28049450\">"
+ "<a href=\"/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450\">"
+ "<h3 class=\"item-title\">MCAppRepo <span class=\"item-default-marker grey\">(default)</span></h3>"
+ "</a>"
+ "<a href=\"/chain/admin/config/confirmDeleteRepository.action?planKey=AB-CSD&repositoryId=28049450\" class=\"delete\" title=\"Remove repository\">"
+ "<span class=\"assistive\">Delete</span>"
+ "</a>"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.item.item-default > a");
System.out.println(elems.first().attr("href"));

Jsoup - extracting text

I need to extract text from a node like this:
<div>
Some text <b>with tags</b> might go here.
<p>Also there are paragraphs</p>
More text can go without paragraphs<br/>
</div>
And I need to build:
Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes.
Is there are way to iterate contents of an element to receive text nodes as well. E.g.
Text node - Some text
Node <b> - with tags
Text node - might go here.
Node <p> - Also there are paragraphs
Text node - More text can go without paragraphs
Node <br> - <empty>
Element.children() returns an Elements object - a list of Element objects. Looking at the parent class, Node, you'll see methods to give you access to arbitrary nodes, not just Elements, such as Node.childNodes().
public static void main(String[] args) throws IOException {
String str = "<div>" +
" Some text <b>with tags</b> might go here." +
" <p>Also there are paragraphs</p>" +
" More text can go without paragraphs<br/>" +
"</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
int i = 0;
for (Node node : div.childNodes()) {
i++;
System.out.println(String.format("%d %s %s",
i,
node.getClass().getSimpleName(),
node.toString()));
}
}
Result:
1 TextNode
Some text
2 Element <b>with tags</b>
3 TextNode might go here.
4 Element <p>Also there are paragraphs</p>
5 TextNode More text can go without paragraphs
6 Element <br/>
for (Element el : doc.select("body").select("*")) {
for (TextNode node : el.textNodes()) {
node.text() ));
}
}
Assuming you want text only (no tags) my solution is below.
Output is:
Some text with tags might go here. Also there are paragraphs. More text can go without paragraphs
public static void main(String[] args) throws IOException {
String str =
"<div>"
+ " Some text <b>with tags</b> might go here."
+ " <p>Also there are paragraphs.</p>"
+ " More text can go without paragraphs<br/>"
+ "</div>";
Document doc = Jsoup.parse(str);
Element div = doc.select("div").first();
StringBuilder builder = new StringBuilder();
stripTags(builder, div.childNodes());
System.out.println("Text without tags: " + builder.toString());
}
/**
* Strip tags from a List of type <code>Node</code>
* #param builder StringBuilder : input and output
* #param nodesList List of type <code>Node</code>
*/
public static void stripTags (StringBuilder builder, List<Node> nodesList) {
for (Node node : nodesList) {
String nodeName = node.nodeName();
if (nodeName.equalsIgnoreCase("#text")) {
builder.append(node.toString());
} else {
// recurse
stripTags(builder, node.childNodes());
}
}
}
you can use TextNode for this purpose:
List<TextNode> bodyTextNode = doc.getElementById("content").textNodes();
String html = "";
for(TextNode txNode:bodyTextNode){
html+=txNode.text();
}

Categories