What is the safest way to convert scraped URLs to real URLs? - java

I scrape a website and find these links on a page:
index.html
bla.html
/index.html
A.com/test.html
http://wwww.B.com/bla.html
If I know the current page is www.A.com/some/path, how can I convert these links into "real Urls" effectively. So, in each case, the urls should translate to:
index.html => http://www.A.com/some/path/index.html
bla.html => http://www.A.com/some/path/bla.html
/index.html => http://www.A.com/index.html
A.com/test.html => http://www.A.com/test.html
http://wwww.B.com/bla.html => http://wwww.B.com/bla.html
What is the most effective way to convert these on-page links to their fully qualified url names?

Use the java.net.URL class:
URL BASE_PATH = new URL("http://www.A.com/some/path");
String RELATIVE_PATH = "index.html";
URL absolute = new URL(BASE_PATH, RELATIVE_PATH);
It will resolve the relative URL against the base path. If the relative URL is actually an absolute URL, it will return it instead.

#Brigham's Answer is correct but incomplete.
The problem is that the page where you scraped the URLs from could include a <base> element in the <head>. This base URL may be significantly different to the URL that you fetched the page from.
For example:
<!DOCTYPE html>
<html>
<head>
<base href="http://www.example.com/">
...
</head>
<body>
...
</body>
</html>
In the ... sections, any relative URLs will be resolved relative to the base URL rather than the original page URL.
What this means that if you want to resolve "scraped" URLs correctly in all cases, you also need to look for any <base> elements as you are "scraping".

Related

URL removing for anchor tags

In my application, the http://localhost:8080/TestApplication/subCategories/2 will display subcategories of my table with id 2.
Click Here
When I click on the link rendered by the HTML above, my server is redirecting to http://localhost:8080/SecondOpinion/subCategories/hello
I want it to redirect to
http://localhost:8080/SecondOpinion/hello
How do I achieve that?
First of all, this is nothing to do with "anchor tags". An anchor tag is an HTML element of the form <a name="here">, and it defines a location within the HTML that another URL can link to.
What you have is an ordinary HTML link, and what you are seeing is standard HTML behavior for a relative link.
A relative link is resolved related to the "parent" URL for the page containing the link.
If you want a link to go somewhere else, you can:
Use an absolute URL
Use path in the relative link; e.g. `Click Here
Put a <base href="..."> element into your document's <head> section.
In your case, you seem to be1 combining relative URLs with some unspecified server-side redirection. In this case, you could either:
change as above, so that the URL that is sent to the server (before redirection) goes to a better place, or
change the redirection logic in your server.
I can't tell which would be more appropriate.
1 - I am inferring this because you said "my server is redirecting to". It is possible that you actually mean that the browser is sending that URL to the server, and there is no redirection happening at all.

Why does jsoup.parse() doesn't return the full HTML document?

I am tinkering around with Jsoup and I wonder why does jsoup.parse(url) returns a partial part of the HTML. Here is my code:
System.out.println(Jsoup.parse("https://www.example.com"))
Here is the output
<html>
<head></head>
<body>
https://www.example.com
</body>
</html>
It is partial at best, and if you go to www.example.com you would see that the parser missed two <p> tags.
Now the docs say this
public static Document parse​(String html)
Parse HTML into a Document. As no base URI is specified, absolute URL
detection
relies on the HTML including a tag.
Parameters: html - HTML to parse
Returns: sane HTML
It says it brings back a document, but in practice its a partial one. Also - what is "sane" HTML?

Jsoup: How to get the returned Documents url, if a redirection was involved in between request and responce

I have a Java web crawler. It is opening this type of urls :
http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=
The final url is different that this, which i guess means that a redirect is involved. I can get and parse the returned Document, but is there any way to get the "final", "real" url too?
That URL is not doing a redirect, is returning a page which has this meta header
<meta http-equiv="refresh" content="0; url=https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158"-->
You can see your "final" url there.
You can parse the document for this tag with (for example) select("meta[http-equiv=refresh]")
And then parse the content attribute.
Summing up:
Document doc = Jsoup.connect("http://jobbank.dk/prefetch_net/s.aspx?c=j&u=/5l/GCyVEQ4dr07BQM6aDvW1I0UefK7VvjHbG5dHDz2P2tCsrbBFYiCBFyAvIdjVnWkl3nwjaUdTp8spu4B9B833lJobgVCKRfM MawPa4AoPK7JvRti4tFFFdmUbtr4LajxRjFH ERBWO7cx43GJ6ColMjDI40vayZSqQ Zl54dK4hqc/nj909Nvb 8Hm9aUmecabYb8Lecyigr3RH/msy NRXW8Le66u2OVepyXyLXHApptPYf2RK42PcqKEawanyjbWAnP8WlT9DaiO/adJ9mEEPIAadtEY/ocN3wSa4=").get();
Elements select = doc.select("meta[http-equiv=refresh]");
String content = select.first().attr("content");
String prefix = "url=";
String url = content.substring(content.indexOf(prefix) + prefix.length());
System.out.println(url);
Will give you your desired uri:
https://krb-xjobs.brassring.com/TGWebHost/jobdetails.aspx?PartnerId=30033&SiteID=5635&type=mail&jobId=722158
I hope it will help.

jsoup : Absolute path while working with files

I got some page repository with html files. I want to process them using jsoup, but when I try to get absolute paths of all links jsoup gave me empty strings (""). Is there a possibility to set baseUri as a file path ?
Solution : link.get(i).baseUri + link.get(i).attr("href") is not sufficient for me becouse i need to some how recognize which link is relative or not.
The jsoup documentation says us :
There is a sister method parse(File in, String charsetName) which uses
the file's location as the baseUri. This is useful if you are working
on a filesystem-local site and the relative links it points to are
also on the filesystem.
But it doesn't work on my PC.
I'm "solving" the same issue with the following code. While I prefer jsoup functions worked on my local file system, I need something in the meantime. That solution is sending the file locations into the parser as the baseURI then concatenating each relative path to that base. Unfortunately, that means I lose the nesting functionality of HTML's "../" that jsoup normally handles with its built-in functions. Furthermore, I can never be as certain about the results as if the built-in functions worked.
Luckily, I use this mainly for JUnit Testing and it should add minor risks to my production code. The context is that I built a local "Internet" to test crawling offline. I create the JSoup Document by sending a local HTML file to it in my JUnit Test Class:
// From my JUnit Test
String testFileName = "HTMLTest_RelativeReferences.html";
String testFilePath = getClass().getResource(testFileName).getPath();
String testFileBaseURI = testFilePath.replace(testFileName, "");
// ...
// Sends filePath and baseURI to testing class that creates JSoup Doc with:
siteDoc = Jsoup.parse(new File(testFilePath), "UTF-8", testFileBaseURI);
Now that I created my Document with a baseURI, you & I both thought the relative paths should use that baseURI to create an absolute path. Since that failed, I run a simple test for empty-string abs:refs and concatenate my own URLs.
Elements links = siteDoc.select("a[href]"); // extract link collection
for (Element link : links) { // iterate through links
String linkString = link.attr("abs:href"); // ftr, neither this nor absUrl("href") works
if (linkString.isEmpty()) { // check if returned "" (i.e., the problem at hand)
URLs.add(siteDoc.baseUri() + link.attr("href")); // concatenate baseURI to relative ref
}
else { // for all the properly returned absolute refs
URLs.add(link.attr("abs:href"));
}
}
All my JUnit Tests continue to pass with both absolute and relative local references - good luck!
HTML Doc I used for reference with all 3 links representing other HTML files in the same folder:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>HTML Test using Relative References</title>
</head>
<body>
Link1
Link2
Link3
</body>
</html>
Edit: I little digging into the jsoup library leads me to believe our local file "URL"s will never work because jsoup handles actual URLs during its attr("abs:href") process and will through MalformedURLs and return "" since we are actually using local file paths rather than true URLs. I consider this outside of the scope of the above answer but thought I would mention my discovery.
You can use the absUrl()-function in JSoup Elements.
String path = linkEl.absUrl("href");

Image Extraction from webpage in java

I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.
I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?
If you just need to get the complete URL from a relative one, the solution is simple in Java:
URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);
The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).
You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.

Categories