JSoup - How to parse nested texts?

JSoup - How to parse nested texts? - java

I'm parsing html of a website with JSoup. I want to parse this part:
<td class="lastpost">
This is a text 1<br>
Website Page - 1
</td>
I want like this:
String text = "This is a text 1";
String textNo = "Website Page - 1";
String link = "post/13594";
How can I get the parts like this?

Your code would only get all the text that is in the td elements that you are selecting. If you want to store the text in separate variables, you should grab the parts separately like the following code. Extra comments added so you can understand how/why it is getting each piece.
// Get the first td element that has class="lastpost"
Element lastPost = document.select("td.lastpost").first();
// Get the first a element that is a child of the td
Element linkElement = lastPost.getElementsByTag("a").first();
// This text is the first child node of td, get that node and call toString
String text = lastPost.childNode(0).toString();
// This is the text within the a (link) element
String textNo = linkElement.text();
// This text is the href attribute value of the a (link) element
String link = linkElement.attr("href");

Related

Xpath from a span text in break

I am trying to get the validate the $40 by asserting but unable to track the xpath. Any Suggestions
<div class="MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12"><div class="vetting-details"><h3 class="paragraph text-center"><span>Vetting Price: </span>$40 <br> <span>Estimated Time for Vetting:</span> 30 seconds</h3></div></div>

That 40 is basically a text node. You can retrieve it using :
WebElement e = driver.findElement(By.xpath("//h3[contains(#class,'paragraph text-center')]"));
String el = (String)((JavascriptExecutor)driver).executeScript("return arguments[0].textContent;", e);
String s = el.split("\\ ")[2].trim();
System.out.println(s)
Explanation :
40 is not a plain text rather it is a text node. we need JS intervention to target the p tag, and then get all the text content. Then using split to split the string to get the desired element.

We can use below code (without JavaScriptExecutor)
String value=driver.findElement(By.tagName("h3")).getText();
String s1=value.split(":")[1].trim();
System.out.println(s1);

Scrape currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

I need get the currency data from website, here the website HTML table code:
<tr>
<td class="currency-up"></td>
<td class="currency">
ABD Doları
</td>
<td class>8,2805</td>
<td class>8,2856</td>
</tr>
I wrote these code but I could not handle the code:
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
Element link = doc.select("href").first();
String linkHref = link.attr("href"); // "http://example.com/"
System.out.println(linkHref);
But I got this problem:
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException:
Cannot invoke "org.jsoup.nodes.Element.attr(String)" because "link" is
null
How can I handle this problem, how can I get currency rate.

You can try like this:
Element link = doc.select("a[href]").first();
If you just type href, it will search for the href tagname, but there is never such a tagname. You have to look for the href attribute of the a tag.
Let's start with a simple example.
Example, to get the value of the 2nd span below the element whose href value is /dolar-kuru/, you can try:
// Example of selection with id.
Element element2 = doc.select("#usd_header_son_data").first();
String usd2 = element2.text();
System.out.println(usd2);
// Example of selecting 2nd span with href value and below. (1)
Element element1 = doc.select("a[href='/dolar-kuru/'] > span > span").first();
String usd1 = element1.text();
System.out.println(usd1);
// Example of selecting 2nd span with href value and below. (2)
Element element3 = doc.select("a[href='/dolar-kuru/'] > span :nth-child(2)").first();
String usd3 = element3.text();
System.out.println(usd3);
We can take the example one step further.
Let's take both the buy and sell prices from a table of exchange rates.
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table td.currency");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
}
Will give an output that looks like this:
ABD Doları [buy=8,2855, sell=8,2888]
Euro [buy=9,8389, sell=9,8645]
İngiliz Sterlini [buy=11,4203, sell=11,4775]
Kanada Doları [buy=6,5696, sell=6,6091]
İsviçre Frangı [buy=9,0128, sell=9,0671]
Suudi Riyali [buy=2,2025, sell=2,2135]
...
More different selectors can be used, see. https://jsoup.org/cookbook/extracting-data/selector-syntax

For the provided HTML code, you can do the following:
Element link = doc.select("a[href]").first();
String linkHref = link.attr("href");
System.out.println(linkHref);
For the url provided in the code sample, if you want to select only the first, you can do:
Element link = doc.select("td.currency > a").first();
String linkHref = link.attr("href");
System.out.println(linkHref);
Just to explain the previous code: "td.currency" will search for "td" tags with the class "currency". And " > a" will retrieve the child elements which are a "a" tag.
And if you want all currencies you can do
Elements links = doc.select("td.currency > a");
links.forEach(link -> System.out.println(link.attr("href")));
Note that there are some duplicates in the last code sample.

Unable to read <p> text under <h3> using selenium

I have an html code something like:
<h3> Some Heading </h3>
<p> Some String </p>
<p> more string </p>
<h3> Other heading</h3>
<p> some text </p>
I am trying to access Some String, more string and some text. With java, am trying to access like this:
List<WebElement> h3Tags = driver.findElements(By.tagName("h3"));
List<WebElement> para = null;
WebElement bagInfo = h3Tags.get(0); //reads first h3
if(bagInfo.getText().contains("carry-on") || bagInfo.getText().contains("Carry-on")){
para = AutoUtils.findElementsByTagName(bagInfo, "p");
System.out.println(para.get(0).getText()); //Null pointer here
}
bagInfo = h3Tags.get(1);
if(bagInfo.getText().contains("checked") || bagInfo.getText().contains("Checked")){
para = AutoUtils.findElementsByTagName(bagInfo, "p");
System.out.println(para.get(0).getText()); //Null pointer here too
}
Tried xpath like "h3['/p']" but still no luck. What is the best way to access those <p> strings?

Try xpath //h3/following-sibling::p to match all 3 paragraphs
Also note that your XPath h3['/p'] doesn't work as it means return h3 node which is DOM root node. Predicate ['/p'] will always return True as non-empty string ('/p' in your case) is always True

To access Some String, more string and some text you can use the following Locator Strategy :
To access the node with text as Some String
By.xpath("//h3[normalize-space()='Some Heading']//following::p[1]")
To access the node with text as more string
By.xpath("//h3[normalize-space()='Some Heading']//following::p[2]")
To access the node with text as some text
By.xpath("//h3[normalize-space()='Other heading']//following::p[1]")
Once you locate those elements you can use getAttribute("innerHTML") method to extract the text within the nodes.

Get class name Jsoup

I am trying to parse some html for android app, but I can't get the value for the data-id class
Here's the html code
<div class="popup event-popup Predavanja" style="display: none;" data-id="246274" data-position="bottom" >
How can I parse the 246274 value?

If you have the Element object of the div tag, then this code will work:
String attr = element.attr("data-id"); // get the value of the 'data-id' attribute
int dataID = Integer.parseInt(attr); // convert it to an int
Optionally, if you want to check first if the attribute even exists, use this:
if (element.hasAttr("data-id")) // etc.

I think you can do like this
Document doc = JSoup.parse(""Url");
Element divElement = doc.select("div.popup event-popup Predavanja").first();//Div with class name
String dataId = divElement.attr("data-id");
Follow this link https://jsoup.org/cookbook/extracting-data/selector-syntax

JSOUP Element.html("<th>test</th>") ignore th tags

I work on a html templating engine based on jsoup.
JSOUP ignore th and td flags if element is not inside table;
To deal with this, I change parser to :
final Document docToWrite = Jsoup.parse(docToRead.outerHtml(),"", Parser.xmlParser());
But I didn't find any solution to fill an Element with html that contain a td or a th:
element.html("<th>test</th>");
return only test, because JSOUP is cleaning html by removing unused tags
How can I solve this?
Thank you

If you element is 'th', then calling:
element.html("<th>test</th>") // th.innerHTML = "<th>test</th>"
should produce dirty html:
<th><th>test</th></th>
which is correctly cleared up by JSoup to:
<th>test</th> // th.innerHTML == "test"
To fill element with innerHTML == "<th>test</th>" your element has to be a <tr> tag.
// Given
String s = "<th>test</th>";
assert element.tag() == "tr";
// When
element.html(s);
// Then
assert element.html().equals(s);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup - How to parse nested texts? - java

I'm parsing html of a website with JSoup. I want to parse this part: <td class="lastpost"> This is a text 1<br> Website Page - 1 </td> I want like this: String text = "This is a text 1"; String textNo = "Website Page - 1"; String link = "post/13594"; How can I get the parts like this?

Related

Xpath from a span text in break

Scrape currency exchange data from https://uzmanpara.milliyet.com.tr/doviz-kurlari/

Unable to read <p> text under <h3> using selenium

Get class name Jsoup

JSOUP Element.html("<th>test</th>") ignore th tags

Categories

Resources