I am trying to parse some html for android app, but I can't get the value for the data-id class
Here's the html code
<div class="popup event-popup Predavanja" style="display: none;" data-id="246274" data-position="bottom" >
How can I parse the 246274 value?
If you have the Element object of the div tag, then this code will work:
String attr = element.attr("data-id"); // get the value of the 'data-id' attribute
int dataID = Integer.parseInt(attr); // convert it to an int
Optionally, if you want to check first if the attribute even exists, use this:
if (element.hasAttr("data-id")) // etc.
I think you can do like this
Document doc = JSoup.parse(""Url");
Element divElement = doc.select("div.popup event-popup Predavanja").first();//Div with class name
String dataId = divElement.attr("data-id");
Follow this link https://jsoup.org/cookbook/extracting-data/selector-syntax
Related
I need get the currency data from website, here the website HTML table code:
<tr>
<td class="currency-up"></td>
<td class="currency">
ABD Doları
</td>
<td class>8,2805</td>
<td class>8,2856</td>
</tr>
I wrote these code but I could not handle the code:
String url = "https://uzmanpara.milliyet.com.tr/doviz-kurlari/";
Document doc = null;
try {
doc = Jsoup.connect(url).timeout(6000).get();
} catch (IOException ex) {
Logger.getLogger(den3.class.getName()).log(Level.SEVERE, null, ex);
}
Element link = doc.select("href").first();
String linkHref = link.attr("href"); // "http://example.com/"
System.out.println(linkHref);
But I got this problem:
Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException:
Cannot invoke "org.jsoup.nodes.Element.attr(String)" because "link" is
null
How can I handle this problem, how can I get currency rate.
You can try like this:
Element link = doc.select("a[href]").first();
If you just type href, it will search for the href tagname, but there is never such a tagname. You have to look for the href attribute of the a tag.
Let's start with a simple example.
Example, to get the value of the 2nd span below the element whose href value is /dolar-kuru/, you can try:
// Example of selection with id.
Element element2 = doc.select("#usd_header_son_data").first();
String usd2 = element2.text();
System.out.println(usd2);
// Example of selecting 2nd span with href value and below. (1)
Element element1 = doc.select("a[href='/dolar-kuru/'] > span > span").first();
String usd1 = element1.text();
System.out.println(usd1);
// Example of selecting 2nd span with href value and below. (2)
Element element3 = doc.select("a[href='/dolar-kuru/'] > span :nth-child(2)").first();
String usd3 = element3.text();
System.out.println(usd3);
We can take the example one step further.
Let's take both the buy and sell prices from a table of exchange rates.
Elements elements = doc.select(".borsaMain > div:nth-child(2) > div:nth-child(1) > table td.currency");
for (Element element : elements) {
Elements curreny = element.parent().select("td:nth-child(2)");
Elements buy = element.parent().select("td:nth-child(3)");
Elements sell = element.parent().select("td:nth-child(4)");
System.out.println(String.format("%s [buy=%s, sell=%s]",
curreny.text(), buy.text(), sell.text()));
}
Will give an output that looks like this:
ABD Doları [buy=8,2855, sell=8,2888]
Euro [buy=9,8389, sell=9,8645]
İngiliz Sterlini [buy=11,4203, sell=11,4775]
Kanada Doları [buy=6,5696, sell=6,6091]
İsviçre Frangı [buy=9,0128, sell=9,0671]
Suudi Riyali [buy=2,2025, sell=2,2135]
...
More different selectors can be used, see. https://jsoup.org/cookbook/extracting-data/selector-syntax
For the provided HTML code, you can do the following:
Element link = doc.select("a[href]").first();
String linkHref = link.attr("href");
System.out.println(linkHref);
For the url provided in the code sample, if you want to select only the first, you can do:
Element link = doc.select("td.currency > a").first();
String linkHref = link.attr("href");
System.out.println(linkHref);
Just to explain the previous code: "td.currency" will search for "td" tags with the class "currency". And " > a" will retrieve the child elements which are a "a" tag.
And if you want all currencies you can do
Elements links = doc.select("td.currency > a");
links.forEach(link -> System.out.println(link.attr("href")));
Note that there are some duplicates in the last code sample.
Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}
I'm parsing html of a website with JSoup. I want to parse this part:
<td class="lastpost">
This is a text 1<br>
Website Page - 1
</td>
I want like this:
String text = "This is a text 1";
String textNo = "Website Page - 1";
String link = "post/13594";
How can I get the parts like this?
Your code would only get all the text that is in the td elements that you are selecting. If you want to store the text in separate variables, you should grab the parts separately like the following code. Extra comments added so you can understand how/why it is getting each piece.
// Get the first td element that has class="lastpost"
Element lastPost = document.select("td.lastpost").first();
// Get the first a element that is a child of the td
Element linkElement = lastPost.getElementsByTag("a").first();
// This text is the first child node of td, get that node and call toString
String text = lastPost.childNode(0).toString();
// This is the text within the a (link) element
String textNo = linkElement.text();
// This text is the href attribute value of the a (link) element
String link = linkElement.attr("href");
I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph. The html file is(Here, I only paste the part that I use.):
<div class="center-col">
<div class="comparison-summary-title-line">...</div>
...
<div id="reportContent" class="report-content">
<!-- This tag handles the report titles component -->
...
<div id="report">
<div id="reportMain">
<div class="timeSection">
<div class = "primaryBand timeBand">...</div>
...
<div aria-lable = "one-chart" style = "position: absolute; ...">
<svg ....>
...
<script type="text/javascript">
var chartData = {...}
And the data I used is stored in the script part(last line). My idea is to get the class "report-content" first, and then select script. And my code follows as:
String html = "https://www.google.com/trends/explore#q=ice%20cream";
Document doc = Jsoup.connect(html).get();
Elements center = doc.getElementsByClass("center-col");
Element report = doc.getElementsByClass("report-content");
System.out.println(center);
System.out.println(report);
When I print "center" class, I can get all the subclasses content except the "report-content", and when I print the "report-content", the result is only like:
<div id="reportContent" Class="report-content"></div>
And I also try this:
Element report = doc.select(div.report-content).first();
but still does not work at all. How could I get the data in the script here? I appreciate your help!!!
Try this url instead:
https://www.google.com/trends/trendsReport?hl=en&q=${keywords}&tz=${timezone}&content=1
where
${keywords} is an encoded space separated keywords list
${timezone} is an encoded timezone in the Etc/GMT* form
DEMO
SAMPLE CODE
String myKeywords = "ice cream";
String myTimezone = "Etc/GMT+2";
String url = "https://www.google.com/trends/trendsReport?hl=en&q=" + URLEncoder.encode(keywords, "UTF-8") +"&tz="+URLEncoder.encode(myTimezone, "UTF-8")+"&content=1";
Document doc = Jsoup.connect(url).timeout(10000).get();
Element scriptElement = doc.select("div#TIMESERIES_GRAPH_0-time-chart + script").first();
if (scriptElement==null) {
throw new RuntimeException("Unable to locate trends data.");
}
String jsCode = scriptElement.html();
// parse jsCode to extract charData...
References:
How to extract the text of a <script> element with Jsoup?
Trying getting the same by Id, you would get the complete tag
I would like to display the default thumbnail image of this YouTube URL in my Android app:
<iframe width="560" height="315" src="https://www.youtube.com/embed/FXx_gbdIUKg" frameborder="0" allowfullscreen=""></iframe>
This is my method for doing so:
static String parseThumbnail(String youTubeURL){
org.jsoup.nodes.Document document = Jsoup.parse(youTubeURL);
Elements youtubeElements = document.select("FXx_gbdIUKg");
org.jsoup.nodes.Document iframeDoc = Jsoup.parse(youtubeElements.get(0).data());
Elements iframeElements = iframeDoc.select("iframe");
return iframeElements.attr("http://img.youtube.com/vi/"+youtubeElements+"/default.jpg");
the iframe is within the "content:encoded" node, so I'm calling this method here.
String itemYouTubeImage = null;
if (XML_TAG_CONTENT_ENCODED.equalsIgnoreCase(tag)) {
String contentEncoded = tagNode.getTextContent();
itemYouTubeImage = parseThumbnail(contentEncoded);
itemImageURL = parseImageFromHTML(contentEncoded);
itemContentEncodedText = parseTextFromHTML(contentEncoded);
How do I properly do this?
One problem I have is that the compiler tells me that the value parseThumbnail(contentEncoded) assigned to itemYouTubeImage is never used
If you want just the default thumbnail, this is provided in the <head> of the youtube HTML document. It is not encoded.
<link itemprop="thumbnailUrl"
href="https://i.ytimg.com/vi/2qhzsn3pZgk/maxresdefault.jpg">
To select on the attribute value and get the absolute URL:
String youtubeUrl = "https://www.youtube.com/watch?v=9wpqE8OSWrU";
Document doc = Jsoup.connect(youtubeUrl).get();
String thumbnailUrl = doc
.select("link[itemprop=thumbnailUrl]")
.first()
.absUrl("href");
System.out.println(thumbnailUrl);
Output
https://i.ytimg.com/vi/9wpqE8OSWrU/maxresdefault.jpg
Read more in the Jsoup cookbook.