How to read image "alt" attributes within links using jsoup?

How to read image "alt" attributes within links using jsoup? - java

I need to read alt attributes with the jsoup library?
For Example :
<a href="www.test.com">
<img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/>
</a>
How can read it?

Here is a code snippet, which reads all the alt attributes of the image tags:
String html = "<a href=\"www.test.com\"> <img src=\"http://test.org/images/icon/socialNetwork/telegram-icon.png\" border=\"0\" alt=\"telegram\"></img>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("img");
for (Element e : elements) {
String alt = e.attr("alt");
System.out.println("alt: " + alt);
}

Related

extracting Text non recursively with Jsoup

this is the code I'm trying to run :
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html); //connect to the page
Element element = doc.getAllElements().first(); //recive the names elements
System.out.println(element.text()); //prints "ZOLA (1)"
System.out.println(element.ownText()); // prints nothing
my goal is to extract only "ZOLA", without the text of the children node, but ownText prints nothing...
how should I do it?

The problem is that doc.getAllElements().first() returns
<html>
<head></head>
<body>
ZOLA <span class="tiny">(1)</span>
</body>
</html>
while you expect
ZOLA <span class="tiny">(1)</span>
The following should work for you:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
System.out.println(links.get(0));
System.out.println(links.get(0).ownText());
Output:
ZOLA <span class="tiny">(1)</span>
ZOLA

You can use this:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Element elementA = doc.selectFirst("a");
System.out.println(elementA.ownText()); // ZOLA

Regex to filter span tag if it having attribute

I have below code where i want to strip or remove span tag if it is not having any attributes using java.
This regex removes all SPAN tags. <(/)?[ ]span[^>]>
e.g.
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p><span> </span></p><p><span>Table</span></p>
output:
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p> </p><p>Table</p>
Any help?

It's not possible. A regular expression can't know which closing </span> tag belongs to the <span> you want to remove. Use a HTML parser such as jsoup.
Edit:
Example
String html = "<span style=\"font-weight: bold;text-decoration-line: underline;\">test</span><p><span> </span></p><p><span>Table</span></p>";
Document doc = Jsoup.parse(html);
for (Element span : doc.getElementsByTag("span")) {
if (span.attributes().size() == 0) {
span.unwrap();
}
}
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();

Try this in java code
var str = // your string here
str = str.replaceAll("<\\/span[^>]*>", "");

Html parse with Jsoup

<div>
<div class = "main">
<div class ="content">
<div class="content_left">
<div class="alisveris_context_box">
<ul class = "sinema_list">
<li>
<a href="blabla/12" title="asd">
<img src="http://asd.jpg">
<span class ="cartoon">
Textaa
</span>
How can I get the href value (blabla/12 in the example) and span value (Textare in the example)?

Lets say your html is the follow.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://example.com/"
link.attr("href") will have your link.
Same for your span. Think for yourself :)
source: http://jsoup.org/cookbook/extracting-data/attributes-text-html

Elements elements = Jsoup.parse(html).select("div[class=main] div[class=content] div[class=content_left] div[class=alisveris_context_box] ul[class=sinema_list] li a");
String href = elements.first().attr("href");
String spanText = elements.first().select("span[class=cartoon]").first().text();

Using Jsoup you can easily find out
You will get span value by this
String st="<div> <div class = \"main\"> <div class =\"content\"> "
+ "<div class=\"content_left\"> <div class=\"alisveris_context_box\">"
+ " <ul class = \"sinema_list\"> <li> <a href=\"blabla/12\" title=\"asd\">"
+ "<img src=\"http://asd.jpg\"> <span class =\"cartoon\"> Textaa </span>";
String spanValue=Jsoup.parse(st).text();
and href value by
String href=Jsoup.parse(st).getElementsByTag("a").attr("href");

JSoup Select Problems

I'm having trouble selecting links in my html. Here's the html I have:
<div class=first>
<a href=www.test1.com>test1</a>
<div class=nope>
<a href=www.test2.com>test2</a>
<a href=www.test3.com>test3</a>
<a href=www.test4.com>test4</a>
</div>
</div>
What I want to do is pull the URLs:
www.test2.com
www.test3.com
www.test4.com
I have tried a lot of diferent .select and .not combinations but I just can't figure it out. Can anyone point out what it is I'm doing wrong?
String url = "<div class=first><a href=www.test1.com>test1</a>One<div class=nope><a href=www.test2.com>test2</a>Two</div></div><div class=second><a href=www.test3.com>test3</a></div>";
Document doc = Jsoup.parse(url);
Elements divs = doc.select("div a[href]").not(".first.nope a[href]");
System.out.println(divs);

Document doc = Jsoup.parse("your html code/url ");
Elements links = doc.select("div.nope a").first();
for (Element link : links) {
System.out.println(link.attr("href"));

I would do it a little different:
Elements elements = doc.select("div.nope").select("a[href]");
for (Element element : elements) {
System.out.println(element.attr("href"));
}

Elements data=doc.getElementsByClass("nope")
for(Element d:data)
{
String yourData= d.tagName("href").toString();
}

Jsoup image tag extraction

i need to extract an image tag using jsoup from this html
<div class="picture">
<img src="http://asdasd/aacb.jpgs" title="picture" alt="picture" />
</div>
i need to extract the src of this img tag ...
i am using this code i am getting null value
Element masthead2 = doc.select("div.picture").first();
String linkText = masthead2.outerHtml();
Document doc1 = Jsoup.parse(linkText);
Element masthead3 = doc1.select("img[src]").first();
String linkText1 = masthead3.html();

Here's an example to get the image source attribute:
public static void main(String... args) {
Document doc = Jsoup.parse("<div class=\"picture\"><img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /></div>");
Element img = doc.select("div.picture img").first();
String imgSrc = img.attr("src");
System.out.println("Img source: " + imgSrc);
}
The div.picture img selector finds the image element under the div.
The main extract methods on an element are:
attr(name), which gets the value of an element's attribute,
text(), which gets the text content of an element (e.g. in <p>Hello</p>, text() is "Hello"),
html(), which gets an element's inner HTML (<div><img></div> html() = <img>), and
outerHtml(), which gets an elements full HTML (<div><img></div> html() = <div><img></div>)
You don't need to reparse the HTML like in your current example, either select the correct element in the first place using a more specific selector, or hit the element.select(string) method to winnow down.

<tr> <td class="blackNoLine" nowrap="nowrap" valign="top" width="25" align="left"><b>CAST: </b></td> <td class="blackNoLine" valign="top" width="416">Jay, Shazahn Padamsee </td> </tr>
You can use:
Document doc = Jsoup.parse(...);
Elements els = doc.select("td[class=blackNoLine]");
Element el= els.get(1);
String castName = el.text();

With the following code I can extract the image correctly:
Document doc = Jsoup.parse("<div class=\"picture\"> <img src=\"http://asdasd/aacb.jpgs\" title=\"picture\" alt=\"picture\" /> </div>");
Element elem = doc.select("div.picture img").first();
System.out.println("elem: " + elem.attr("src"));
I'm using jsoup release 1.2.2, the latest one.
Maybe you're trying to print the inner html of an empty tag like img.
From the documentation: "html() - Retrieves the element's inner HTML".
For the second portion of html you can use:
Document doc2 = Jsoup.parse("<tr> <td class=\"blackNoLine\" nowrap=\"nowrap\" valign=\"top\" width=\"25\" align=\"left\"><b>CAST: </b></td> <td class=\"blackNoLine\" valign=\"top\" width=\"416\">Jay, Shazahn Padamsee </td> </tr>");
Elements trElems = doc2.select("tr");
if (trElems != null) {
for (Element element : trElems) {
Element secondTd = element.select("td").get(1);
System.out.println("name: " + secondTd.text());
}
}
which prints "Jay, Shazahn Padamsee".

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read image "alt" attributes within links using jsoup? - java

I need to read alt attributes with the jsoup library? For Example : <a href="www.test.com"> <img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/> </a> How can read it?

Related

extracting Text non recursively with Jsoup

Regex to filter span tag if it having attribute

Html parse with Jsoup

JSoup Select Problems

Jsoup image tag extraction

Categories

Resources