How to access the subclass using jsoup - java

I want to access this webpage: https://www.google.com/trends/explore#q=ice%20cream and extract the data within in the center line graph. The html file is(Here, I only paste the part that I use.):
<div class="center-col">
<div class="comparison-summary-title-line">...</div>
...
<div id="reportContent" class="report-content">
<!-- This tag handles the report titles component -->
...
<div id="report">
<div id="reportMain">
<div class="timeSection">
<div class = "primaryBand timeBand">...</div>
...
<div aria-lable = "one-chart" style = "position: absolute; ...">
<svg ....>
...
<script type="text/javascript">
var chartData = {...}
And the data I used is stored in the script part(last line). My idea is to get the class "report-content" first, and then select script. And my code follows as:
String html = "https://www.google.com/trends/explore#q=ice%20cream";
Document doc = Jsoup.connect(html).get();
Elements center = doc.getElementsByClass("center-col");
Element report = doc.getElementsByClass("report-content");
System.out.println(center);
System.out.println(report);
When I print "center" class, I can get all the subclasses content except the "report-content", and when I print the "report-content", the result is only like:
<div id="reportContent" Class="report-content"></div>
And I also try this:
Element report = doc.select(div.report-content).first();
but still does not work at all. How could I get the data in the script here? I appreciate your help!!!

Try this url instead:
https://www.google.com/trends/trendsReport?hl=en&q=${keywords}&tz=${timezone}&content=1
where
${keywords} is an encoded space separated keywords list
${timezone} is an encoded timezone in the Etc/GMT* form
DEMO
SAMPLE CODE
String myKeywords = "ice cream";
String myTimezone = "Etc/GMT+2";
String url = "https://www.google.com/trends/trendsReport?hl=en&q=" + URLEncoder.encode(keywords, "UTF-8") +"&tz="+URLEncoder.encode(myTimezone, "UTF-8")+"&content=1";
Document doc = Jsoup.connect(url).timeout(10000).get();
Element scriptElement = doc.select("div#TIMESERIES_GRAPH_0-time-chart + script").first();
if (scriptElement==null) {
throw new RuntimeException("Unable to locate trends data.");
}
String jsCode = scriptElement.html();
// parse jsCode to extract charData...
References:
How to extract the text of a <script> element with Jsoup?

Trying getting the same by Id, you would get the complete tag

Related

Getting Data from multiple a tags in HTML

I am scraping a medical website where I need to extract header wise information regarding a drug e.g Precautions, Contraindications,Dosage, Uses etc. The HTML data looks like below. If I just extract info using the tag p.drug-content I get content under all the headers as one big paragraph. How do I get header wise content where the paragraph for dosage should come under dosage, Precautions under precautions, so on and so forth?
<a name="Warning"></a>
<div class="report-content drug-widget">
<div class="drug-header"><h2 style="color:#000000!important;">What are the warnings and precautions for Abacavir? </h2></div>
<p class="drug-content">
• Caution is advised when used in patients with history of depression or at risk for heart disease<br>• Avoid use with alcohol.<br>• Take along with other anti-HIV drugs and not alone, to prevent resistance.<br>• Continue other precautions to prevent spread of HIV infection.</p></div>
<a name="Prescription"></a>
<div class="report-content drug-widget">
<div class="drug-header"><h2 style="color:#000000!important;">Why is Abacavir Prescribed? (Indications) </h2></div>
<p class="drug-content">Abacavir is an antiviral drug that is effective against the HIV-1 virus. It acts on an enzyme of the virus called reverse transcriptase, which plays an important role in its multiplication. Though abacavir reduces viral load and may slow the progression of the disease, it does not cure the HIV infection. </p></div>
<a name="Dosage"></a>
<div class="report-content drug-widget">
<div class="drug-header"><h2 style="color:#000000!important;">What is the dosage of Abacavir?</h2></div>
<p class="drug-content"> Treatment of HIV-1/AIDS along with other medications. Dose in adults is 600 mg daily, as a single dose or divided into two doses.
</p></div>
Here is my code:
private static void ScrapingDrugInfo() throws IOException{
Connection.Response response = null;
Document doc = null;
List<SideEffectsObject> sideEffectsList = new ArrayList<>();
int i=0;
String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
for (String keyword : keywords){
final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;
response = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.execute();
doc = response.parse();
Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();
Elements links = tds.select("li[class=list-item]");
for (Element link : links){
final String newURL = "https://www.medindia.net/doctors/drug_information/".concat(link.select("a").attr("href")) ;
response = Jsoup.connect(newURL)
.userAgent("Mozilla/5.0")
.execute();
doc = response.parse();
Elements classification = doc.select("div.clear.b");
System.out.println("Classification::"+classification.text());
Elements drugBrands = doc.select("div.drug-content");
Elements drugBrandsIndian = drugBrands.select("div.links");
System.out.println("Drug Brand Links Indian::"+ drugBrandsIndian.select("a[href]"));
System.out.println("Drug Brand Names Indian::"+ drugBrandsIndian.text());
System.out.println("Drug Brand Names International::"+doc.select("div.drug-content.h3").text());
Elements prescritpionText = doc.select("a[name=Prescription]");
Elements prescriptionData = prescritpionText.select("p.drug-content");
System.out.println("Prescription Data::"+ prescriptionData.text());
Elements contraindications = doc.select("a[name=Contraindications]");
Elements contraindicationsText = contraindications.select("p[class=drug-content]");
System.out.println("Contrainidications Text::" + contraindicationsText.text());
Elements dosage = doc.select("a[name=Dosage]");
Elements dosageText = dosage.select("p[class=drug-content]");
System.out.println("Dosage Text::" + dosageText.text());
}
}
If I understand the question correctly, it sounds like you want to pair up the value of the a tags name attribute with the p content of the following div. You should be able to do that with the following code:
Elements aTags = doc.select("a[name]");
for(Element header : aTags){
System.out.println(header.attr("name"));
// Get the sibling div of a and get it's p content
Element pTag = header.nextElementSibling().select("p.drug-content").first();
System.out.println(pTag.text());
}

Get class name Jsoup

I am trying to parse some html for android app, but I can't get the value for the data-id class
Here's the html code
<div class="popup event-popup Predavanja" style="display: none;" data-id="246274" data-position="bottom" >
How can I parse the 246274 value?
If you have the Element object of the div tag, then this code will work:
String attr = element.attr("data-id"); // get the value of the 'data-id' attribute
int dataID = Integer.parseInt(attr); // convert it to an int
Optionally, if you want to check first if the attribute even exists, use this:
if (element.hasAttr("data-id")) // etc.
I think you can do like this
Document doc = JSoup.parse(""Url");
Element divElement = doc.select("div.popup event-popup Predavanja").first();//Div with class name
String dataId = divElement.attr("data-id");
Follow this link https://jsoup.org/cookbook/extracting-data/selector-syntax

Parsing a YouTube thumbnail in an iframe with Jsoup

I would like to display the default thumbnail image of this YouTube URL in my Android app:
<iframe width="560" height="315" src="https://www.youtube.com/embed/FXx_gbdIUKg" frameborder="0" allowfullscreen=""></iframe>
This is my method for doing so:
static String parseThumbnail(String youTubeURL){
org.jsoup.nodes.Document document = Jsoup.parse(youTubeURL);
Elements youtubeElements = document.select("FXx_gbdIUKg");
org.jsoup.nodes.Document iframeDoc = Jsoup.parse(youtubeElements.get(0).data());
Elements iframeElements = iframeDoc.select("iframe");
return iframeElements.attr("http://img.youtube.com/vi/"+youtubeElements+"/default.jpg");
the iframe is within the "content:encoded" node, so I'm calling this method here.
String itemYouTubeImage = null;
if (XML_TAG_CONTENT_ENCODED.equalsIgnoreCase(tag)) {
String contentEncoded = tagNode.getTextContent();
itemYouTubeImage = parseThumbnail(contentEncoded);
itemImageURL = parseImageFromHTML(contentEncoded);
itemContentEncodedText = parseTextFromHTML(contentEncoded);
How do I properly do this?
One problem I have is that the compiler tells me that the value parseThumbnail(contentEncoded) assigned to itemYouTubeImage is never used
If you want just the default thumbnail, this is provided in the <head> of the youtube HTML document. It is not encoded.
<link itemprop="thumbnailUrl"
href="https://i.ytimg.com/vi/2qhzsn3pZgk/maxresdefault.jpg">
To select on the attribute value and get the absolute URL:
String youtubeUrl = "https://www.youtube.com/watch?v=9wpqE8OSWrU";
Document doc = Jsoup.connect(youtubeUrl).get();
String thumbnailUrl = doc
.select("link[itemprop=thumbnailUrl]")
.first()
.absUrl("href");
System.out.println(thumbnailUrl);
Output
https://i.ytimg.com/vi/9wpqE8OSWrU/maxresdefault.jpg
Read more in the Jsoup cookbook.

How to parse HTML Heading

I have this HTML i am parsing.
<div id="articleHeader">
<h1 class="headline">Assassin's Creed Revelations: The Three Heroes</h1>
<h2 class="subheadline">Exclusive videos and art spanning three eras of assassins.</h2>
<h2 class="publish-date"><script>showUSloc=(checkLocale('uk')||checkLocale('au'));document.writeln(showUSloc ? '<strong>US, </strong>' : '');</script>
<span class="us_details">September 22, 2011</span>
What i want to do it parse the "headline" subheadline and publish date all to seperate Strings
Just use the proper CSS selectors to grab them.
Document document = Jsoup.connect(url).get();
String headline = document.select("#articleHeader .headline").text();
String subheadline = document.select("#articleHeader .subheadline").text();
String us_details = document.select("#articleHeader .us_details").text();
// ...
Or a tad more efficient:
Document document = Jsoup.connect(url).get();
Element articleHeader = document.select("#articleHeader").first();
String headline = articleHeader.select(".headline").text();
String subheadline = articleHeader.select(".subheadline").text();
String us_details = articleHeader.select(".us_details").text();
// ...
Android has a SAX parser built into it . You can use other standard XML parsers as well.
But I think if ur HTML is simple enough u could use RegEx to extract string.

jsoup tag extraction problem

test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..
Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}

Categories