How to use JSoup to get hyperlink href? - java

I have the following jsFiddle
http://jsfiddle.net/B5zvV/
I am trying to use JSoup to obtain the value of the hyperlink's href string on Line 238:
<a href="/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450">
Hence, the desired result would be to obtain a String with a value of:
/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450
Here's my code:
Document doc = Jsoup.connect("http://myapp.example.com/fizz.html").get()
Elements elems = doc.getElementsByAttributeValueContaining("href", "repositoryId")
When I run this, the value of elems is empty: why, and what do I need to do to get the desired String?

The getElementsByAttributeValueContaining() method will return multiple values in this case because many hrefs has repositoryId. If you are particular about line 238 then that a is enclosed inside an li with class item item-default. There is only one such li and two a tags inside it. Just take the first one like
String html = "<li class=\"item item-default\" data-item-id=\"28049450\" id=\"item-28049450\">"
+ "<a href=\"/chain/admin/config/editRepository.action?planKey=AB-CSD&repositoryId=28049450\">"
+ "<h3 class=\"item-title\">MCAppRepo <span class=\"item-default-marker grey\">(default)</span></h3>"
+ "</a>"
+ "<a href=\"/chain/admin/config/confirmDeleteRepository.action?planKey=AB-CSD&repositoryId=28049450\" class=\"delete\" title=\"Remove repository\">"
+ "<span class=\"assistive\">Delete</span>"
+ "</a>"
+ "</li>";
Document doc = Jsoup.parse(html);
Elements elems = doc.select("li.item.item-default > a");
System.out.println(elems.first().attr("href"));

Related

Parsing html in Jsoup

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.
I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
SpannableStringBuilder text = new SpannableStringBuilder();
if (value.contains("</span>")) {
Document document = Jsoup.parse(value);
Elements elements = document.getElementsByTag("span");
if (elements != null) {
int i = 0;
int start = 0;
for (Element ele : elements) {
String styleName = type + "." + ele.attr("class");
text.append(ele.text());
int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
text.append(ele.nextSibling().toString());
start = text.length();
i++;
}
}
return text;
}
I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".
Need output such as:
- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>
Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}
output:
0 ->
There are
1 -> two
2 -> workers from the
3 -> Front of House
By the way: I don't know how to get rid of the first line break before There are.

How to get anchor tag href and anchor tag text inside a div using Selenium in Java

My HTML code consists of multiple divs. Inside each div is a list of anchor tags. I need to fetch the href values and text values of the anchor tags that are in the sub-container div. I'm using Selenium to get the HTML code of the webpage.
HTML code:
<body>
<div id="main-container">
One
Two
Three
<div id="sub-container">
Abc
Xyz
Pqr
</div>
</div>
</body>
Java code:
List<WebElement> list = driver.findElements(By.xpath("//*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(e.getTagName() + "=" + link);
}
Output:
a=www.one.com
a=www.two.com
a=www.three.com
a=www.abc.com
a=www.xyz.com
a=www.pqr.com
Output I need:
a=www.abc.com , Abc
a=www.xyz.com , Xyz
a=www.pqr.com , Pqr
Try this,
List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/*[#href]"));
for (WebElement element : list) {
String link = element.getAttribute("href");
System.out.println(element.getTagName() + "=" + link +", "+ element.getText());
}
You can use element.getText() to get the link text.
If you only want to select the links in the sub-container, you can adjust your xPath:
//*[#id="sub-container"]/a
Pretty simple, try as below:
`List<WebElement> list = driver.findElements(By.xpath("//div[#id='sub-container']/a"));
for (WebElement element : list) {
String link = element.getAttribute("href");
String text = element.getText();
System.out.println(e.getTagName() + "=" + link + ", " + text);
}
if id sub-container is unique, just use the below line
driver.findElements(By.cssSelector("div#sub-container>a"));
thanks

JSoup search by attribute and class

You can do:
Elements links = doc.select("a[href]");
to find all "a" elements with an href attribute.
And you can do:
doc.getElementsByClass("title")
to get all elements with a class that is called "title"
But how can I do both? (I.e search for an "a" element with an "href" tag that also has the class "title").
You can simply have
Elements links = doc.select("a[href].title");
This will select all <a> having an href attribute with a title class. The class is passed by prepending it with a dot:
Selector combinations
Any combination, e.g. a[href].highlight
Full example:
public static void main(String[] args) {
Document doc = Jsoup.parse(""
+ "<div>"
+ " <a href='link1' class='title another'>Link 1</a>"
+ " <a href='link2' class='another'>Link 2</a>"
+ " <a href='link3'>Link 3</a>"
+ "</div>");
Elements links = doc.select("a[href].title");
System.out.println(links); // prints "Link 1"
}

Jsoup not selector not returning result

Trying to use Jsoup selector to select everything in a div with class 'content', but at the same time not select any divs with class social,or media. I know I can do a simple select and loop, but would have expected the :not function to work for my purpose. Perhaps, my selector syntax is wrong.
public static void main(String args[]) throws ParseException {
String html = "<html>\n" +
"<body>\n" +
"<div class=\"content\">\n" +
"\t<p>some paragraph</p>\n" +
"\t<div class=\"social media\">\n" +
"\tfind us on facebook\n" +
"\t</div\n" +
"</div>\n" +
"</body>\n" +
"</html>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("div.content div:not(.social)");
System.out.println(elements.text());
}
Expected result: "some paragraph"
Actual result: null
Your selector as it is, matches divs that do not have class="social" and are childs of div with class="content". To have the expected outcome use this
Elements elements = doc.select("div.content :not(.social)");
Or this
Elements elements = doc.select("div.content").not(".social");

get inner tags from HTML

I'm new to JSoup so a help would be great. In this official tutorial, we could learn that code below gets inner tag as String:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""
String linkOuterH = link.outerHtml();
// "<b>example</b>"
String linkInnerH = link.html(); // "<b>example</b>"
But what if the HTML string is extremely long and therefore contains a lot of tags? Is there a limitation in this case? How could I make an array or arraylist of String containing all inner tags of "a"
?
Jsoup is similar to a DOM parser. It paser the entire html to a tree structure. So the size that it could parse depends on the java heap size you configured.
And as for getting a tag there are several ways. Easiest one would be document.select() method. Just like Masud's answer.
Document document = Jsoup.parser(html);
List<String> tags = new ArrayList<String>();
for(Element e : document.select("a")){
tags.add(e.tagName());
}
System.out.println("The tags = " + tags);
//If you want it as array
String[] tagsArray = tags.toArray(new String[tags.size()]);
You can refer to this answer for more option How to Count total Html Tags using Jsoup
How could I make an array or arraylist of String containing all inner
tags of "a"
You can return Elements from doc. Here Elements is an array containing all <a> tag.
Document doc = Jsoup.parse(html);
Elements allAnchorTags = doc.select("a");
System.out.println(allAnchorTags); // It will print all tag string.

Categories