get links in a web site - java

how can i get links in a web page without loading it? (basically what i want is this. a user enters a URL and i want to load all the available links inside that URL.) can you please tell me a way to achieve this

Here is example Java code, specifically:
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class Main {
public static void main(String args[]) throws Exception {
URL url = new URL(args[0]);
Reader reader = new InputStreamReader((InputStream) url.getContent());
System.out.println("<HTML><HEAD><TITLE>Links for " + args[0] + "</TITLE>");
System.out.println("<BASE HREF=\"" + args[0] + "\"></HEAD>");
System.out.println("<BODY>");
new ParserDelegator().parse(reader, new LinkPage(), false);
System.out.println("</BODY></HTML>");
}
}
class LinkPage extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
System.out.println("<A HREF=\"" + a.getAttribute(HTML.Attribute.HREF) + "\">"
+ a.getAttribute(HTML.Attribute.HREF) + "</A><BR>");
}
}
}

You'll have to load the page on your server and then find the links, preferably by loading up the document in an HTML/XML parser and traversing that DOM. The server could then send the links back to the client.
You can't do it on the client because the browser won't let your Javascript code look at the contents of the page from a different domain.

If you want the content of a page you'll have to load it. But what you can do is loading it in memory and parse it to get all the <a> tags and their content.
You'll be able to parse this XML with tools like JDom or Sax if you're working with java (as your tag says) or with simple DOM tools with javascript.
Resources :
Parse XML with javascript
On the same topic :
get all the href attributes of a web site (javascript)

Just open an URLConnection, gets the page and parse it.

public void extract_link(String site)
{
try {
List<String> links = extractLinks(site);
for (String link : links) {
System.out.println(link);
}
} catch (Exception e) {
System.out.println(e);
}
}
This is a simple function to view all links in a page.
If you want to view link in the inner links , just call it recursively(but make sure you give a limit according to your need).

Related

Create a String variable from a URL using JSoup and Regex in Java?

So I am trying to make a program that retrieves the IFrame tag from a website, opens the link and downloads the video. Currently, it retrieves the IFrame tag, but I can't figure out how to ignore the actual tags. I am pretty sure I can use the .split() feature, but I don't know how to create a regex code to only pull the data from inside of the quotes. I also tried using JSoup's .html, but it just printed a blank statement. Here is what I have (It mostly split correctly, except in the URL there is "id=..." which causes it to split again):
package com.trentmenard;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
Document website;
try{
website = Jsoup.connect("https://swordartonlineepisode.com/sword-art-online-season-3-episode-1-english-dubbed-watch-online/").get();
System.out.println("Website Found! Title: " + website.title());
Element videoLink = website.select("iframe").first();
System.out.println("Found Video Link: " + videoLink);
videoLink.removeAttr("width");
videoLink.removeAttr("height");
videoLink.removeAttr("scrolling");
videoLink.removeAttr("allowfullscreen");
System.out.println("Modified: " + videoLink);
String link = videoLink.toString();
String[] stringArray = link.split("=");
for(String a : stringArray){
System.out.println(a);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Output: https://i.stack.imgur.com/ZXTiV.png
Thanks in advance!

Why "http://www.stackoverflow.com" is not getting parsed but "http://www.javatpoint.com/java-tutorial" is getting parsed

I am trying to learn the basic methods of jsoup.I tried to get all the hyperlinks
of a particular web page.But i used stackoverflow link then,i am unable to get all the hyperlinks on that page ,but on the other side when i changed it to
javatpoint it's working.
Can someone explain Why??
Here is the code.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.jsoup.*;
import org.jsoup.nodes.*;
import java.io.*;
import org.jsoup.nodes.Document;
class Repo {
// String html;
public static void main(String s[]) throws IOException {
try {
Document doc = Jsoup.connect("http://www.javatpoint.com/java-tutorial").get();
// Document doc=Jsoup.connect("http://www.stackoverflow.com").get();
System.out.println("doc");
// Elements link=(Elements)doc.select("span[class]");
// Elements link = doc.select("span").first();
// Elements link = (Elements)doc.select("span");
Elements link = (Elements) doc.select("a[href]");
for (Element el : link) {
// System.out.print("-");
// System.out.println(el.attr("class"));
String str = el.attr("href");
System.out.println(str);
}
} catch (Exception e) {
}
}
}
Many websites require valid http requests to carry certain headers. A prominent example is the userAgent header. SO for example will work with this:
Document doc = Jsoup
.connect("http://www.stackoverflow.com")
.userAgent("Mozilla/5.0")
.get();
Side note:
You should never try catch exceptions and then silently ignore the possible fail case. At least do some logging there - otherwise your programs will be very hard to debug.

How to get all the source code from a page with Jsoup - Java [duplicate]

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

how to differentiate xml from html links in java

I have a list of links, containing links to html and xml pages, how can I extract the xml links from the list? in java
thanks
You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.
There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <!DOCTYPE ...> declaration, check for the <?xml ...?> directive, and check to see if the file contains a root <html> tag. Of course, these should all be case-insensitive checks.
You can also try to identify the type of file based on its MIME type (for example, text/html or text/xml). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.
package com.example.mimetype;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.FileNameMap;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.sf.jmimemagic.Magic;
import net.sf.jmimemagic.MagicException;
import net.sf.jmimemagic.MagicMatchNotFoundException;
import net.sf.jmimemagic.MagicParseException;
public class MimeUtils {
// After calling this method, you can retrieve a list of URLs for each mimetype.
public static Map<String, List<String>> sortLinksByMimeType(List<String> links) {
Map<String, List<String>> mapMimeTypesToLinks = new HashMap<String, List<String>>();
for (String url : links) {
try {
String mimetype = getMimeType(url);
System.out.println(url + " has mimetype " + mimetype);
// If this mimetype hasn't already been initialized, initialize it.
if (! mapMimeTypesToLinks.containsKey(mimetype)) {
mapMimeTypesToLinks.put(mimetype, new ArrayList<String>());
}
List<String> lst = mapMimeTypesToLinks.get(mimetype);
lst.add(url);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return mapMimeTypesToLinks;
}
public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException {
// first attempt at determining MIME type--returned null for all URLs that I tried
// FileNameMap filenameMap = URLConnection.getFileNameMap();
// return filenameMap.getContentTypeFor(url);
// second attempt at determining MIME type--worked better, but still returned null for many URLs
// URLConnection c = new URL(url).openConnection();
// InputStream in = c.getInputStream();
// String mimetype = URLConnection.guessContentTypeFromStream(in);
// in.close();
// return mimetype;
URLConnection c = new URL(url).openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
byte[] content = new byte[100];
in.read(content);
in.close();
return Magic.getMagicMatch(content, false).getMimeType();
}
public static void main(String[] args) {
List<String> links = new ArrayList<String>();
links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java");
links.add("http://stackoverflow.com");
links.add("http://stackoverflow.com/feeds");
links.add("http://amazon.com");
links.add("http://google.com");
sortLinksByMimeType(links);
}
}
I'm not certain if your links are some sort of Link object, but as long as you can access the value as a string this should work I think.
List<String> xmlLinks = new ArrayList<String>();
for (String link : list) {
if (link.endsWith(".xml") || link.contains(".xml")) {
xmlLinks.add(link);
}
}

Full Link Extraction using java

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example:
Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link
You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.
package com.stackoverflow.q3394298;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String... args) throws Exception {
URL url = new URL("https://stackoverflow.com/questions/3394298/");
Document document = Jsoup.connect(url).get();
Element link = document.select("a.question-hyperlink").first();
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
which prints (correctly) the following for the title link of your current question:
/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
Jsoup may have more other (undiscovered) advantages for your purpose as well.
Related questions:
What are the pros and cons of the leading HTML parsers in Java?
Update: if you want to select all links in the document, then do as follows:
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
Use the URL object:
URL url = new URL(URL context, String spec)
Here's an example:
import java.net.*;
public class Test {
public static void main(String[] args) throws Exception {
URL base = new URL("http://www.java.com/dit/index.html");
URL url = new URL(base, "../hello.html");
System.out.println(base);
System.out.println(url);
}
}
It will print:
http://www.java.com/dit/index.html
http://www.java.com/hello.html

Categories