Unable to get links from html - jsoup - java

With the following code I am able to get the desired text from the website but am unable to get the associated link of the text. Tried several method permutations and combinations. At most what I get is the entire outer html as given below:
<li class="list-item">
<h4><a class="bold" href="abacavir.htm">Abacavir </a> </h4>
Abacavir is an antiviral drug that is effective against the HIV-1 virus.</li>
Here is the code:
public static void main(String[] args) throws Exception {
Map<String,String> drugLinks = new LinkedHashMap<String,String>();
final int OK = 200;
//String currentURL;
//int page = 1;
int status = OK;
Connection.Response response = null;
Document doc = null;
String[] keywords = {"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"};
//String keyword = "a";
for (String keyword : keywords){
final String url = "https://www.medindia.net/doctors/drug_information/home.asp?alpha=" + keyword;
response = Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.execute();
status = response.statusCode();
doc = response.parse();
Element tds = doc.select("div.related-links.top-gray.col-list.clear-fix").first();
Elements links = tds.select("li[class=list-item]");
for (Element link : links){
System.out.println("generic::"+link.select("a[href]").text());
System.out.println("link::"+link.attr("abs:a"));
}
}
}
Output
generic::Abacavir
link::
generic::Abacavir Sulfate and Lamivudine
link::
generic::Abacavir Sulfate, Lamivudine and Zidovudine
link::
generic::Abaloparatide
link::
generic::Abarelix
link::
How do i get the absolute links from the given HTML?

To get the link from the element, you can use:
link.select("a").attr("href")
However, that will only give you the relative link.
The full link will be:
"https://www.medindia.net/doctors/drug_information/" + link.select("a").attr("href")

Related

Scraping multiple pages with jsoup

I am trying to scrap links in pagination of GitHub repositories
I have scraped them separately but what Now I want is to optimize it using some loop. Any idea how can i do it? here is code
ComitUrl= "http://github.com/apple/turicreate/commits/master";
Document document2 = Jsoup.connect(ComitUrl ).get();
Element pagination = document2.select("div.pagination a").get(0);
String Url1 = pagination.attr("href");
System.out.println("pagination-link1 = " + Url1);
Document document3 = Jsoup.connect(Url1).get();
Element pagination2 = document3.select("div.pagination a").get(1);
String Url2 = pagination2.attr("href");
System.out.println("pagination-link2 = " + Url2);
Document document4 = Jsoup.connect(Url2).get();
Element check = document4.select("span.disabled").first();
if (check.text().equals("Older")) {
System.out.println("No pagination link more");
}
else { Element pagination3 = document4.select("div.pagination a").get(1);
String Url3 = pagination3.attr("href");
System.out.println("pagination-link3 = " + Url3);
}
Try something like given below:
public static void main(String[] args) throws IOException{
String url = "http://github.com/apple/turicreate/commits/master";
//get first link
String link = Jsoup.connect(url).get().select("div.pagination a").get(0).attr("href");
//an int just to count up links
int i = 1;
System.out.println("pagination-link_"+ i + "\t" + link);
//parse next page using link
//check if the div on next page has more than one link in it
while(Jsoup.connect(link).get().select("div.pagination a").size() >1){
link = Jsoup.connect(link).get().select("div.pagination a").get(1).attr("href");
System.out.println("pagination-link_"+ (++i) +"\t" + link);
}
}

Java jsoup link extracting

I am trying to extract the links within a given element in jsoup. Here what I have done but its not working:
Document doc = Jsoup.connect(url).get();
Elements element = doc.select("section.row");
Element s = element.first();
Elements se = s.getElementsByTag("article");
for(Element link : se){
System.out.println("link :" + link.select("href"));
}
Here is the html:
The thing I am trying to do is get all the links withing the article classes. I thought that maybe first I must select the section class ="row", and then after that derive somehow the links from the article class but I could not make it work.
Try out this.
Document doc = Jsoup.connect(url).get();
Elements section = doc.select("#main"); //select section with the id = main
Elements allArtTags = section.select("article"); // select all article tags in that section
for (Element artTag : allArtTags ){
Elements atags = artTag.select("a"); //select all a tags in each article tag
for(Element atag : atags){
System.out.println(atag.text()); //print the link text or
System.out.println(atag.attr("href"));//print link
}
}
I'm using this in one of my projects:
final Elements elements = doc.select("div.item_list_section.item_description");
you'll have to get the elements you want to extract links from.
private static ... inspectElement(Element e) {
try {
final String name = getAttr(e, "a[href]");
final String link = e.select("a").first().attr("href");
//final String price = getAttr(e, "span.item_price");
//final String category = getAttr(e, "span.item_category");
//final String spec = getAttr(e, "span.item_specs");
//final String datetime = e.select("time").attr("datetime");
...
}
catch (Exception ex) { return null; }
}
private static String getAttr(Element e, String what) {
try {
return e.select(what).first().text();
}
catch (Exception ex) { return ""; }
}

Jsoup and list of attachments

I've this HTML block:
ul class="list_attachments"><li>
<img src='pdf.png' alt='pdf'/> File1</li><li>
<img src='pdf.png' alt='pdf'/> File2</li>
</ul>
I would like to extract all the "a href" row, in particular site and name file informations.
So I tried this:
String [] fileName = new String[2];
String [] url = new String[2];
int i=0;
attachments = document.select(".list_attachments");
for (Element attachment : attachments) {
String fileName[i] = attachment.text();
String url[i] = attachment.select("a").attr("href");
i++;
}
But the result is:
String fileName = "File1 File2";
String url = "www.site1.com";
The problem is that there is only one attachment element instead of two as I expected.
How to solve this? Thanks.
.select(".list_attachments") can select only <ul class="list_attachments"> so it returns Elements with only one <ul> element. I suspect that you wanted to select all <a ...> elements which exist inside .list_attachments and then take their text and href. In that case your code should look more like
Elements anchors = document.select(".list_attachments a");
for (Element anchor : anchors) {
fileName[i] = anchor.text();
url[i] = anchor.attr("href");
i++;
}

Extracting content from html tags using java

I extracted data from an html page and then parsed the tags containing tags like this now I tried different ways like extracting substring etc do extract only the title and href tags. but it'snot working..Can anyone help me. This is the small snippet of my output
my code
doc = Jsoup.connect("myurl").get();
Elements link = doc.select("a[href]");
String stringLink = null;
for (int i = 0; i < link.size(); i++)
{
stringLink = link.toString();
System.out.println(stringLink);
}
output
<a class="link" title="Waf Ad" href="https://www.facebook.com/waf.ad.54"
data- jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https:
//fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186729_100007938933785_
508764241_q.jpg" alt="Waf Ad" data-jsid="img" /></a>
<a class="link" title="Ana Ga" href="https://www.facebook.com/ata.ga.31392410"
data-jsid="anchor" target="_blank"><img class="_s0 _rw img" src="https://
fbcdn-profile-a.akamaihd.net/hprofile-ak-ash1/t5/186901_100002334679352_
162381693_q.jpg" alt="Ana Ga" data-jsid="img" /></a>
You can use the attr() method of Element class to extract the value of attributes.
For example:
String href = link.attr("href");
String title = link.attr("title");
See this page for more: Extract attributes, text, and HTML from elements
To get the page title, you can use
Document doc = Jsoup.connect("myurl").get();
String title = doc.title();
For getting the individual links from the different hrefs, you can use this
Elements links = doc.select("a[href]");
for(Element ele : links) {
System.out.println(ele.attr("href").toString());
}
attr() method gives the content inside the matching attributed spedified to it in the given tag.
public class Solution{
public static void main(String[] args){
Scanner scan = new Scanner(System.in);
int testCases = Integer.parseInt(scan.nextLine());
while (testCases-- > 0) {
String line = scan.nextLine();
boolean matchFound = false;
Pattern r = Pattern.compile("<(.+)>([^<]+)</\\1>");
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(2));
matchFound = true;
}
if ( ! matchFound) {
System.out.println("None");
}
}
}
}
REGULAR EXPRESSION EXPLAINATION:

Jsoup selecting and replacing multiple <a> elements

So I am just trying out the Jsoup API and have a simple question. I have a string and would like to keep the string in tact except when passed through my method. I want the string to pass through this method and take out the elements that wrap the links. Right now I have:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements select = Jsoup.parse(html).select("a");
String linkHref = select.attr("href");
System.out.println(linkHref);
}}
This returns the first URL unwrapped only. I would like all URLs unwrapped as well as the original string. Thanks in advance
EDIT: SOLUTION:
Thanks alot for the answer and I edited it only slightly to get the results I wanted. Here is the solution in full that I am using:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link : links) {
doc.select("a").unwrap();
}
System.out.println(doc.text());
}
}
Thanks again
Here's the corrected code:
public class jsTesting {
public static void main(String[] args) {
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link and after that is a second link called <a href='http://example2.com/'><b>example2</b></a></p>";
Elements links = Jsoup.parse(html).select("a[href]"); // a with href;
for (Element link : links) {
//Do whatever you want here
System.out.println("Link Attr : " + link.attr("abs:href"));
System.out.println("Link Text : " + link.text());
}
}
}

Categories