unable to find absolute URL - java

I'm writing some code to find absolute URLS of a single webpage:
http://explore.bfi.org.uk/4ce2b69ea7ef3
So far I get all the links of that page and print the absolute urls
Here is part of the code:
Elements hyperLinks = htmlDoc.select("a[href]");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}
This prints out alot or urls just like the one above. However, it seems to skip a few URLS aswell. The ones it skips are the ones I actually need.
This is one of the a[href] elements its not turning into the absolute URL:
<div class="title">Royal Review<br /></div>
It will print this line if I just print "link" but when I put "abs:href", it will just print blank.
I am new to Java and appreciate any feedback!

You shouldn't use "a[href]", use "a" instead following this example:
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
So in your case:
Elements hyperLinks = htmlDoc.select("a");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}

Related

Modifying a regex for a better matching

Considering we have this regex to match URLs in a page:
(https?):\\/\\/(www\\.)?[a-z0-9\\.:].*?(?=\\s)
My question is that how we can improve it to, for example, matches:
http://stackoverflow.com/questions/ask
instead of :
http://stackoverflow.com/questions/ask">home</a></div><div>
In short, I want it to filter any of ;:'".,<>?«»“”‘’ that usually comes after URLs in a page HTML code.
Since you already are using JSOUP, I think the best way to get all links is using this library. It is not losing originality, it is a question of your code safety, readability and maintainability.
Here is an example from Jsoup Cookbook how you can collect all links from href and src attributes (basically, from your regex I see that you only want to match those):
List<String> links = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
for (Element src : media) {
links.add(src.attr("abs:src"));
}
for (Element link : imports) {
links.add(link.attr("abs:href"));
}
for (Element link : links) {
links.add(link.attr("abs:href"));
}
Note that this method will let you get all absolute path links, even those that were relative.
If you do not need that and want to show off to your boss, try the following regex:
https?://(?:www\\.)?[a-z0-9.:][^<>"]+
See the regex demo
Here, [^<>"]+ matches 1 or more symbols other than <, > and ". Thus, it restricts the regex not to cross the attribute boundary. You might want to add ' there, too.

JSoup Extracting absolute url of a href and a div tag data simultaneously

I want to extract two tags from a website beside each others(adjacently), the first tag is a href and it should be extracted as the the absolute url . the second tag is a div tag and I should extract
the data inside it.
I want the output to be as the following
100 USD http:\www.somesite..............
200 usd http:\www.thesite.............
Why? because later I will insert them into a table in a database .
I tried with the following code but I couldn't get the absolute url in addition I couldn't get rid of the tags while I want to extract the data only (without tags).
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element link : doc.select("div.rightFloat.price,a[abs:href].more-details"))
{
String absHref = url.attr("abs:href");
String attr = link.absUrl("href");
System.out.println(link);
}
If I try using
System.out.println(link.text())
in my code I will miss the hyperlink completely !
Any help please?
I don't think that Jsoup css selector combinators (i.e. the comma in the selector) guarantees an ordering in the output. At least I would not count on it, even if you find the two elements in the ordering you expect. Instead of using the comma selector, I would first loop over the outer containers that hold the adjacent divs you are interested in. Within each div you can then access the price and link.
something like this. Note, that this is out of my head and untested!
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element adDiv : doc.select("div.category-listing-normal-ad")){
Element priceDiv = adDiv.select("div.rightFloat.price").first();
Element linkA = adDiv.select("a.more-details").first();
System.out.println(priceDiv.text() + " " + linkA.absUrl("href"));
}

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?
I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

using logical or,not,and in jsoup selectors

Using this
for (Element link : links) {
String linkHref = link.attr("href and !#");
String linkText = link.text();
}
i can get all the links which has "a href=.."
However, there are some
href="#"
that i don't need in my string.
So i need to do something like
String linkHref = link.attr("href and !#")
i.e i don't want to save the link that has "#" as href.
is that possible or do i have to use regular expression instead ?
Please help.
After reading your question, it looks like you want to Select all anchor tags which do not have '#' as href. You can use :not Selector:
Elements links = doc.select("a[href]"); // All anchor tags with href
links = links.select(":not(href='#')"); // Filter out links which do have href=#
Jsoup select accept comma for Selector combinations
doc.select("[href], [src]"); // href **OR** src
For AND Just combine them in a single CSS selector. Check this answer.
doc.select("a[href][:not(href='#')]");

How do I parse an HTML document with JSoup to get a list of links?

I am trying to parse http://www.craigslist.org/about/sites to build a set of text/links to load a program dynamically with this information. So far I have done this:
Document doc = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements elms = doc.select("div.colmask"); // gets 7 countries
Below this tag there are doc.select("div.state_delimiter,ul") tags I am trying to get. I setup my iterator and go into a while look and call iterator.next().outerHtml();. I see all the tags for each country.
How can I step through each div.state_delimiter, pull that text then go down until
there is a </ul> which defines the end of the states individual counties/cities links/text?
I was playing around with this and can do it by setting outerHtml() to a String and then parsing the string manually, but I am sure there is an easier way to do this. I have tried text() and also tried attr("div.state_delimiter"), but I think I am messing up the pattern/routine to do this properly. Was wondering if someone could help me out here and show me how to get the div.state_delimiter into a text field and then the <ul><li></li></ul> I want all the <li></li> under the <ul></ul> for each state. Looking to grab the http:// && html that goes along with it as easy as possible.
The <ul> containing the cities is the next sibling of the <div class="state_delimiter">. You can use Element#nextElementSibling() to grab it from that div on. Here's a kickoff example:
Document document = Jsoup.connect("http://www.craigslist.org/about/sites").get();
Elements countries = document.select("div.colmask");
for (Element country : countries) {
System.out.println("Country: " + country.select("h1.continent_header").text());
Elements states = country.select("div.state_delimiter");
for (Element state : states) {
System.out.println("\tState: " + state.text());
Elements cities = state.nextElementSibling().select("li");
for (Element city : cities) {
System.out.println("\t\tCity: " + city.text());
}
}
}
The doc.select("div.state_delimiter,ul") doesn't do what you want. It returns all <div class="state_delimiter"> and <ul> elements of the document. Manually parsing it by string functions makes no sense if you've already a HTML parser at hands.

Categories