extracting Text non recursively with Jsoup

extracting Text non recursively with Jsoup - java

this is the code I'm trying to run :
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html); //connect to the page
Element element = doc.getAllElements().first(); //recive the names elements
System.out.println(element.text()); //prints "ZOLA (1)"
System.out.println(element.ownText()); // prints nothing
my goal is to extract only "ZOLA", without the text of the children node, but ownText prints nothing...
how should I do it?

The problem is that doc.getAllElements().first() returns
<html>
<head></head>
<body>
ZOLA <span class="tiny">(1)</span>
</body>
</html>
while you expect
ZOLA <span class="tiny">(1)</span>
The following should work for you:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Elements links = doc.getElementsByTag("a");
System.out.println(links.get(0));
System.out.println(links.get(0).ownText());
Output:
ZOLA <span class="tiny">(1)</span>
ZOLA

You can use this:
String html = "ZOLA <span class=\"tiny\">(1)</span>";
Document doc = Jsoup.parse(html);
Element elementA = doc.selectFirst("a");
System.out.println(elementA.ownText()); // ZOLA

Related

Regex to filter span tag if it having attribute

I have below code where i want to strip or remove span tag if it is not having any attributes using java.
This regex removes all SPAN tags. <(/)?[ ]span[^>]>
e.g.
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p><span> </span></p><p><span>Table</span></p>
output:
<span style="font-weight: bold;text-decoration-line: underline;">test</span><p> </p><p>Table</p>
Any help?

It's not possible. A regular expression can't know which closing </span> tag belongs to the <span> you want to remove. Use a HTML parser such as jsoup.
Edit:
Example
String html = "<span style=\"font-weight: bold;text-decoration-line: underline;\">test</span><p><span> </span></p><p><span>Table</span></p>";
Document doc = Jsoup.parse(html);
for (Element span : doc.getElementsByTag("span")) {
if (span.attributes().size() == 0) {
span.unwrap();
}
}
doc.outputSettings().prettyPrint(false);
String result = doc.body().html();

Try this in java code
var str = // your string here
str = str.replaceAll("<\\/span[^>]*>", "");

How to read image "alt" attributes within links using jsoup?

I need to read alt attributes with the jsoup library?
For Example :
<a href="www.test.com">
<img src="http://test.org/images/icon/socialNetwork/telegram-icon.png" border="0" alt="telegram"/>
</a>
How can read it?

Here is a code snippet, which reads all the alt attributes of the image tags:
String html = "<a href=\"www.test.com\"> <img src=\"http://test.org/images/icon/socialNetwork/telegram-icon.png\" border=\"0\" alt=\"telegram\"></img>";
Document document = Jsoup.parse(html);
Elements elements = document.getElementsByTag("img");
for (Element e : elements) {
String alt = e.attr("alt");
System.out.println("alt: " + alt);
}

Html parse with Jsoup

<div>
<div class = "main">
<div class ="content">
<div class="content_left">
<div class="alisveris_context_box">
<ul class = "sinema_list">
<li>
<a href="blabla/12" title="asd">
<img src="http://asd.jpg">
<span class ="cartoon">
Textaa
</span>
How can I get the href value (blabla/12 in the example) and span value (Textare in the example)?

Lets say your html is the follow.
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();
String linkHref = link.attr("href"); // "http://example.com/"
link.attr("href") will have your link.
Same for your span. Think for yourself :)
source: http://jsoup.org/cookbook/extracting-data/attributes-text-html

Elements elements = Jsoup.parse(html).select("div[class=main] div[class=content] div[class=content_left] div[class=alisveris_context_box] ul[class=sinema_list] li a");
String href = elements.first().attr("href");
String spanText = elements.first().select("span[class=cartoon]").first().text();

Using Jsoup you can easily find out
You will get span value by this
String st="<div> <div class = \"main\"> <div class =\"content\"> "
+ "<div class=\"content_left\"> <div class=\"alisveris_context_box\">"
+ " <ul class = \"sinema_list\"> <li> <a href=\"blabla/12\" title=\"asd\">"
+ "<img src=\"http://asd.jpg\"> <span class =\"cartoon\"> Textaa </span>";
String spanValue=Jsoup.parse(st).text();
and href value by
String href=Jsoup.parse(st).getElementsByTag("a").attr("href");

Jsoup: select(div[class=rslt prod]) returns null when it shouldn't

I am trying to select the all div with class="rlts prod" from this page http://www.amazon.fr/s/field-keywords=samsung
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung").get();
Elements divProd = doc.select("div[class=rslt prod]");
System.out.println("\nsize: "+divProd.size());
But it returns 0 and it shouldn't, any idea why ?
example of what should be selected:
<div id="result_4" class="rslt prod" name="B006O9QNHU">
[...]
</div>

You have to change the user agent, otherwise you get a differnt website from amazon.
Document doc = Jsoup.connect("http://www.amazon.fr/s/field-keywords=samsung")
.userAgent("Mozilla/17.0") // you can use any other user agent here
.get();
for( Element element : doc.select("div[class=rslt prod]") )
{
System.out.println(element);
System.out.println("");
}
Now the output is a list like
<div id="result_1" class="rslt prod" name="B007XOM6SU">
...
</div>
<div id="result_2" class="rslt prod" name="B006SXSF4Q">
...
</div>
...

JSoup Select Problems

I'm having trouble selecting links in my html. Here's the html I have:
<div class=first>
<a href=www.test1.com>test1</a>
<div class=nope>
<a href=www.test2.com>test2</a>
<a href=www.test3.com>test3</a>
<a href=www.test4.com>test4</a>
</div>
</div>
What I want to do is pull the URLs:
www.test2.com
www.test3.com
www.test4.com
I have tried a lot of diferent .select and .not combinations but I just can't figure it out. Can anyone point out what it is I'm doing wrong?
String url = "<div class=first><a href=www.test1.com>test1</a>One<div class=nope><a href=www.test2.com>test2</a>Two</div></div><div class=second><a href=www.test3.com>test3</a></div>";
Document doc = Jsoup.parse(url);
Elements divs = doc.select("div a[href]").not(".first.nope a[href]");
System.out.println(divs);

Document doc = Jsoup.parse("your html code/url ");
Elements links = doc.select("div.nope a").first();
for (Element link : links) {
System.out.println(link.attr("href"));

I would do it a little different:
Elements elements = doc.select("div.nope").select("a[href]");
for (Element element : elements) {
System.out.println(element.attr("href"));
}

Elements data=doc.getElementsByClass("nope")
for(Element d:data)
{
String yourData= d.tagName("href").toString();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

extracting Text non recursively with Jsoup - java

You can use this: String html = "ZOLA <span class=\"tiny\">(1)</span>"; Document doc = Jsoup.parse(html); Element elementA = doc.selectFirst("a"); System.out.println(elementA.ownText()); // ZOLA

Related

Regex to filter span tag if it having attribute

How to read image "alt" attributes within links using jsoup?

Html parse with Jsoup

Jsoup: select(div[class=rslt prod]) returns null when it shouldn't

JSoup Select Problems

Categories

Resources