Full Link Extraction using java - java

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example:
Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link

You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.
package com.stackoverflow.q3394298;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String... args) throws Exception {
URL url = new URL("https://stackoverflow.com/questions/3394298/");
Document document = Jsoup.connect(url).get();
Element link = document.select("a.question-hyperlink").first();
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
which prints (correctly) the following for the title link of your current question:
/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
Jsoup may have more other (undiscovered) advantages for your purpose as well.
Related questions:
What are the pros and cons of the leading HTML parsers in Java?
Update: if you want to select all links in the document, then do as follows:
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}

Use the URL object:
URL url = new URL(URL context, String spec)
Here's an example:
import java.net.*;
public class Test {
public static void main(String[] args) throws Exception {
URL base = new URL("http://www.java.com/dit/index.html");
URL url = new URL(base, "../hello.html");
System.out.println(base);
System.out.println(url);
}
}
It will print:
http://www.java.com/dit/index.html
http://www.java.com/hello.html

Related

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.

"Exception in thread "main" java.Lang.NullPointerException" Error

I am trying to run a web scraper in Eclipse that, using Jsoup, that can take the names of the professors on this page: yu.edu/faculty and print them out. This is my code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class YUscraper {
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://yu.edu/faculty/";
Document document = Jsoup.connect(url).get();
// Extract data
Element content = document.getElementById("mainlist");
Elements names = content.getElementsByTag("a");
// Output data
for (Element name : names) {
System.out.println("Name: " + name.text());
}
}
}
I am getting this error:
Exception in thread "main" java.lang.NullPointerException
at YUscraper.main(YUscraper.java:18)
I am relatively new to this so pardon me if I am missing something really evident. I used many examples I have seen to get to this point, but I still don't understand what throws IOException is for and what it means that an exception was found. Please help, thanks!
Line 18 is
Elements names = content.getElementsByTag("a");
Seems like there is no tag with id "mainlist" in the html retrieved from http://yu.edu/faculty/.
Seems like you were trying to access tag main-nav instead of mainlist.
Line Element content = document.getElementById("mainlist");
content is returned as null, so null.getElementsByTag is giving the error .. Looks like html doesn't have element by 'mainlist'

How to get dynamic contents of any web page in DOM tree using JSOUP in Java

In my project, which parses the HTML page, then uses the DOM tree for different operations, just like, comparing templates of two URLS.
For that, I am using JSOUP.
But it does not able to load Dynamic contents in DOM tree.
Can you tell me how can I load dynamic content using JSOUP in Java, or can you tell me any other method for doing the same?
EDIT NO. 1
As given link shows, it works using PhantomJS and Zombie.js in Java. Can you tell me how can I do this ?
Edit No. 2
I first try to get dynamic page by using Selenium, and the code is as follows,
public static void main(String[] args) throws IOException {
// Selenium
WebDriver driver = new FirefoxDriver();
driver.get("ANY URL HERE");
String html_content = driver.getPageSource();
driver.get("ANOTHER URL HERE");
String html_content1 = driver.getPageSource();
driver.close();
// Jsoup makes DOM here by parsing HTML content
Document doc1 = Jsoup.parse(html_content);
Document doc2 = Jsoup.parse(html_content1);
// OPERATIONS USING DOM TREE
}
But this takes lots of time after optimizing also. Now as per your instructions, I moved to HtmlUnit.
But I am not able to make code, that gets Dynamic Page source code into String , and then I use this String for further paring using Jsoup, help me to write that code using HtmlUnit.
Code using HtmlUnit :-
package XXX.YYY.ZZZ.Template_Matching;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import org.junit.Assert;
import org.junit.Test;
/**
*
* #author jhamb
*/
public class HtmlUnit {
#Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://www.jabong.com/Yepme-3-4Th-Sleeve-Printed-Blue-Top-Mksp-191481.html");
Document ht = page.getOwnerDocument();
System.out.println(ht);
webClient.closeAllWindows();
}
public static void main(String[] args) throws Exception {
HtmlUnit htmlUnit = new HtmlUnit();
htmlUnit.homePage();
}
}
I'm afraid, JSoup won't work in this case.
Try using HtmlUnit.

Extract https urls using jsoup

I have the following code that extracts urls from a given page using jsoup.
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* Example program to list links from a URL.
*/
public class ListLinks {
public static void main(String[] args) throws IOException {
String url = "http://shopping.yahoo.com";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).get();
Elements links = doc.getElementsByTag("a");
print("\nLinks: (%d)", links.size());
for (Element link : links) {
print(" * a: <%s> (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
What I'm trying to do, is to build a crawler that extracts only https site. I give the crawler a seed link to start with, then it should extracts all https site, then take each of the extracted link and do the same with them until reaching a certain number of collected urls.
My questions: The above code can extract all links in a given page. I need to extract links that starts with https:// only, what do I need to do in order to achieve this ?
You can use selectors of jsoup. They are pretty powerful.
doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https
doc.select("a[href^=www]");//selects if value of href starts with www
doc.select("a[href$=.com]");//selects if value of href ends with .com.
etc.. Experiment with them, you will find out the correct one.

get links in a web site

how can i get links in a web page without loading it? (basically what i want is this. a user enters a URL and i want to load all the available links inside that URL.) can you please tell me a way to achieve this
Here is example Java code, specifically:
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class Main {
public static void main(String args[]) throws Exception {
URL url = new URL(args[0]);
Reader reader = new InputStreamReader((InputStream) url.getContent());
System.out.println("<HTML><HEAD><TITLE>Links for " + args[0] + "</TITLE>");
System.out.println("<BASE HREF=\"" + args[0] + "\"></HEAD>");
System.out.println("<BODY>");
new ParserDelegator().parse(reader, new LinkPage(), false);
System.out.println("</BODY></HTML>");
}
}
class LinkPage extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
System.out.println("<A HREF=\"" + a.getAttribute(HTML.Attribute.HREF) + "\">"
+ a.getAttribute(HTML.Attribute.HREF) + "</A><BR>");
}
}
}
You'll have to load the page on your server and then find the links, preferably by loading up the document in an HTML/XML parser and traversing that DOM. The server could then send the links back to the client.
You can't do it on the client because the browser won't let your Javascript code look at the contents of the page from a different domain.
If you want the content of a page you'll have to load it. But what you can do is loading it in memory and parse it to get all the <a> tags and their content.
You'll be able to parse this XML with tools like JDom or Sax if you're working with java (as your tag says) or with simple DOM tools with javascript.
Resources :
Parse XML with javascript
On the same topic :
get all the href attributes of a web site (javascript)
Just open an URLConnection, gets the page and parse it.
public void extract_link(String site)
{
try {
List<String> links = extractLinks(site);
for (String link : links) {
System.out.println(link);
}
} catch (Exception e) {
System.out.println(e);
}
}
This is a simple function to view all links in a page.
If you want to view link in the inner links , just call it recursively(but make sure you give a limit according to your need).

Categories