How to read image "alt" attributes within links using jsoup? - java

I need to read alt attributes with the jsoup library?
For Example :
<a href="www.test.com">
<img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/>
</a>
How can read it?

Here is a code snippet, which reads all the alt attributes of the image tags:
String html = "<a href=\"www.test.com\"> <img src=\"http://test.org/images/icon/socialNetwork/telegram-icon.png\" border=\"0\" alt=\"telegram\"></img>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("img");
for (Element e : elements) {
String alt = e.attr("alt");
System.out.println("alt: " + alt);
}

Related

extracting Text non recursively with Jsoup

this is the code I'm trying to run :
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html); //connect to the page
Element element = doc.getAllElements().first(); //recive the names elements
System.out.println(element.text()); //prints "ZOLA (1)"
System.out.println(element.ownText()); // prints nothing
my goal is to extract only "ZOLA", without the text of the children node, but ownText prints nothing...
how should I do it?
The problem is that doc.getAllElements().first() returns
<html>
<head></head>
<body>
ZOLA <span class="tiny">(1)</span>
</body>
</html>
while you expect
ZOLA <span class="tiny">(1)</span>
The following should work for you:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
System.out.println(links.get(0));
System.out.println(links.get(0).ownText());
Output:
ZOLA <span class="tiny">(1)</span>
ZOLA
You can use this:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Element elementA = doc.selectFirst("a");
System.out.println(elementA.ownText()); // ZOLA

Regex to filter span tag if it having attribute

I have below code where i want to strip or remove span tag if it is not having any attributes using java.
This regex removes all SPAN tags. <(/)?[ ]span[^>]>
e.g.
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p><span> </span></p><p><span>Table</span></p>
output:
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p> </p><p>Table</p>
Any help?
It's not possible. A regular expression can't know which closing </span> tag belongs to the <span> you want to remove. Use a HTML parser such as jsoup.
Edit:
Example
String html = "<span style=\"font-weight: bold;text-decoration-line: underline;\">test</span><p><span> </span></p><p><span>Table</span></p>";
Document doc = Jsoup.parse(html);
for (Element span : doc.getElementsByTag("span")) {
if (span.attributes().size() == 0) {
span.unwrap();
}
}
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();
Try this in java code
var str = // your string here
str = str.replaceAll("<\\/span[^>]*>", "");

Html parse with Jsoup

<div>
<div class = "main">
<div class ="content">
<div class="content_left">
<div class="alisveris_context_box">
<ul class = "sinema_list">
<li>
<a href="blabla/12" title="asd">
<img src="http://asd.jpg">
<span class ="cartoon">
Textaa
</span>
How can I get the href value (blabla/12 in the example) and span value (Textare in the example)?
Lets say your html is the follow.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://example.com/"
link.attr("href") will have your link.
Same for your span. Think for yourself :)
source: http://jsoup.org/cookbook/extracting-data/attributes-text-html
Elements elements = Jsoup.parse(html).select("div[class=main] div[class=content] div[class=content_left] div[class=alisveris_context_box] ul[class=sinema_list] li a");
String href = elements.first().attr("href");
String spanText = elements.first().select("span[class=cartoon]").first().text();
Using Jsoup you can easily find out
You will get span value by this
String st="<div> <div class = \"main\"> <div class =\"content\"> "
+ "<div class=\"content_left\"> <div class=\"alisveris_context_box\">"
+ " <ul class = \"sinema_list\"> <li> <a href=\"blabla/12\" title=\"asd\">"
+ "<img src=\"http://asd.jpg\"> <span class =\"cartoon\"> Textaa </span>";
String spanValue=Jsoup.parse(st).text();
and href value by
String href=Jsoup.parse(st).getElementsByTag("a").attr("href");

JSoup Select Problems

I'm having trouble selecting links in my html. Here's the html I have:
<div class=first>
<a href=www.test1.com>test1</a>
<div class=nope>
<a href=www.test2.com>test2</a>
<a href=www.test3.com>test3</a>
<a href=www.test4.com>test4</a>
</div>
</div>
What I want to do is pull the URLs:
www.test2.com
www.test3.com
www.test4.com
I have tried a lot of diferent .select and .not combinations but I just can't figure it out. Can anyone point out what it is I'm doing wrong?
String url = "<div class=first><a href=www.test1.com>test1</a>One<div class=nope><a href=www.test2.com>test2</a>Two</div></div><div class=second><a href=www.test3.com>test3</a></div>";
Document doc = Jsoup.parse(url);
Elements divs = doc.select("div a[href]").not(".first.nope a[href]");
System.out.println(divs);
Document doc = Jsoup.parse("your html code/url ");
Elements links = doc.select("div.nope a").first();
for (Element link : links) {
System.out.println(link.attr("href"));
I would do it a little different:
Elements elements = doc.select("div.nope").select("a[href]");
for (Element element : elements) {
System.out.println(element.attr("href"));
}
Elements data=doc.getElementsByClass("nope")
for(Element d:data)
{
String yourData= d.tagName("href").toString();
}

Jsoup image tag extraction

i need to extract an image tag using jsoup from this html
<div class="picture">
<img src="http://asdasd/aacb.jpgs" title="picture" alt="picture" />
</div>
i need to extract the src of this img tag ...
i am using this code i am getting null value
Element masthead2 = doc.select("div.picture").first();
String linkText = masthead2.outerHtml();
Document doc1 = Jsoup.parse(linkText);
Element masthead3 = doc1.select("img[src]").first();
String linkText1 = masthead3.html();
Here's an example to get the image source attribute:
public static void main(String... args) {
Document doc = Jsoup.parse("<div class=\"picture\"><img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /></div>");
Element img = doc.select("div.picture img").first();
String imgSrc = img.attr("src");
System.out.println("Img source: " + imgSrc);
}
The div.picture img selector finds the image element under the div.
The main extract methods on an element are:
attr(name), which gets the value of an element's attribute,
text(), which gets the text content of an element (e.g. in <p>Hello</p>, text() is "Hello"),
html(), which gets an element's inner HTML (<div><img></div> html() = <img>), and
outerHtml(), which gets an elements full HTML (<div><img></div> html() = <div><img></div>)
You don't need to reparse the HTML like in your current example, either select the correct element in the first place using a more specific selector, or hit the element.select(string) method to winnow down.
<tr> <td class="blackNoLine" nowrap="nowrap" valign="top" width="25" align="left"><b>CAST: </b></td> <td class="blackNoLine" valign="top" width="416">Jay, Shazahn Padamsee </td> </tr>
You can use:
Document doc = Jsoup.parse(...);
Elements els = doc.select("td[class=blackNoLine]");
Element el= els.get(1);
String castName = el.text();
With the following code I can extract the image correctly:
Document doc = Jsoup.parse("<div class=\"picture\"> <img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /> </div>");
Element elem = doc.select("div.picture img").first();
System.out.println("elem: " + elem.attr("src"));
I'm using jsoup release 1.2.2, the latest one.
Maybe you're trying to print the inner html of an empty tag like img.
From the documentation: "html() - Retrieves the element's inner HTML".
For the second portion of html you can use:
Document doc2 = Jsoup.parse("<tr> <td class=\"blackNoLine\" nowrap=\"nowrap\" valign=\"top\" width=\"25\" align=\"left\"><b>CAST: </b></td> <td class=\"blackNoLine\" valign=\"top\" width=\"416\">Jay, Shazahn Padamsee </td> </tr>");
Elements trElems = doc2.select("tr");
if (trElems != null) {
for (Element element : trElems) {
Element secondTd = element.select("td").get(1);
System.out.println("name: " + secondTd.text());
}
}
which prints "Jay, Shazahn Padamsee".

Categories