JSoup Select Problems - java

I'm having trouble selecting links in my html. Here's the html I have:
<div class=first>
<a href=www.test1.com>test1</a>
<div class=nope>
<a href=www.test2.com>test2</a>
<a href=www.test3.com>test3</a>
<a href=www.test4.com>test4</a>
</div>
</div>
What I want to do is pull the URLs:
www.test2.com
www.test3.com
www.test4.com
I have tried a lot of diferent .select and .not combinations but I just can't figure it out. Can anyone point out what it is I'm doing wrong?
String url = "<div class=first><a href=www.test1.com>test1</a>One<div class=nope><a href=www.test2.com>test2</a>Two</div></div><div class=second><a href=www.test3.com>test3</a></div>";
Document doc = Jsoup.parse(url);
Elements divs = doc.select("div a[href]").not(".first.nope a[href]");
System.out.println(divs);

Document doc = Jsoup.parse("your html code/url ");
Elements links = doc.select("div.nope a").first();
for (Element link : links) {
System.out.println(link.attr("href"));

I would do it a little different:
Elements elements = doc.select("div.nope").select("a[href]");
for (Element element : elements) {
System.out.println(element.attr("href"));
}

Elements data=doc.getElementsByClass("nope")
for(Element d:data)
{
String yourData= d.tagName("href").toString();
}

Related

extracting Text non recursively with Jsoup

this is the code I'm trying to run :
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html); //connect to the page
Element element = doc.getAllElements().first(); //recive the names elements
System.out.println(element.text()); //prints "ZOLA (1)"
System.out.println(element.ownText()); // prints nothing
my goal is to extract only "ZOLA", without the text of the children node, but ownText prints nothing...
how should I do it?
The problem is that doc.getAllElements().first() returns
<html>
<head></head>
<body>
ZOLA <span class="tiny">(1)</span>
</body>
</html>
while you expect
ZOLA <span class="tiny">(1)</span>
The following should work for you:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
System.out.println(links.get(0));
System.out.println(links.get(0).ownText());
Output:
ZOLA <span class="tiny">(1)</span>
ZOLA
You can use this:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Element elementA = doc.selectFirst("a");
System.out.println(elementA.ownText()); // ZOLA

How to select 1 span from 2 in a div using jsoup

html code
<div title class="Example">
<span>first div</span> <!---->
<span class="second div">second span</span></div>
Java code
Document doc = Jsoup.connect("example.com").get();
Elements elemenx = doc.select("div.Example span");
for (Element e: elemenx) {
System.out.println(e.text());
}
How i can get only the first span
I found solution, need to add nth-child
Elements elemenx = doc.select("div.Example span:nth-child(1)");
maybe someone will be useful

Regex to filter span tag if it having attribute

I have below code where i want to strip or remove span tag if it is not having any attributes using java.
This regex removes all SPAN tags. <(/)?[ ]span[^>]>
e.g.
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p><span> </span></p><p><span>Table</span></p>
output:
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p> </p><p>Table</p>
Any help?
It's not possible. A regular expression can't know which closing </span> tag belongs to the <span> you want to remove. Use a HTML parser such as jsoup.
Edit:
Example
String html = "<span style=\"font-weight: bold;text-decoration-line: underline;\">test</span><p><span> </span></p><p><span>Table</span></p>";
Document doc = Jsoup.parse(html);
for (Element span : doc.getElementsByTag("span")) {
if (span.attributes().size() == 0) {
span.unwrap();
}
}
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();
Try this in java code
var str = // your string here
str = str.replaceAll("<\\/span[^>]*>", "");

How to read image "alt" attributes within links using jsoup?

I need to read alt attributes with the jsoup library?
For Example :
<a href="www.test.com">
<img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/>
</a>
How can read it?
Here is a code snippet, which reads all the alt attributes of the image tags:
String html = "<a href=\"www.test.com\"> <img src=\"http://test.org/images/icon/socialNetwork/telegram-icon.png\" border=\"0\" alt=\"telegram\"></img>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("img");
for (Element e : elements) {
String alt = e.attr("alt");
System.out.println("alt: " + alt);
}

Jsoup: select(div[class=rslt prod]) returns null when it shouldn't

I am trying to select the all div with class="rlts prod" from this page http://www.amazon.fr/s/field-keywords=samsung
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung").get();
Elements divProd = doc.select("div[class=rslt prod]");
System.out.println("\nsize: "+divProd.size());
But it returns 0 and it shouldn't, any idea why ?
example of what should be selected:
<div id="result_4" class="rslt prod" name="B006O9QNHU">
[...]
</div>
You have to change the user agent, otherwise you get a differnt website from amazon.
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung")
.userAgent("Mozilla/17.0") // you can use any other user agent here
.get();
for( Element element : doc.select("div[class=rslt prod]") )
{
System.out.println(element);
System.out.println("");
}
Now the output is a list like
<div id="result_1" class="rslt prod" name="B007XOM6SU">
...
</div>
<div id="result_2" class="rslt prod" name="B006SXSF4Q">
...
</div>
...

Categories