Modifying a regex for a better matching - java

Considering we have this regex to match URLs in a page:
(https?):\\/\\/(www\\.)?[a-z0-9\\.:].*?(?=\\s)
My question is that how we can improve it to, for example, matches:
http://stackoverflow.com/questions/ask
instead of :
http://stackoverflow.com/questions/ask">home</a></div><div>
In short, I want it to filter any of ;:'".,<>?«»“”‘’ that usually comes after URLs in a page HTML code.

Since you already are using JSOUP, I think the best way to get all links is using this library. It is not losing originality, it is a question of your code safety, readability and maintainability.
Here is an example from Jsoup Cookbook how you can collect all links from href and src attributes (basically, from your regex I see that you only want to match those):
List<String> links = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
for (Element src : media) {
links.add(src.attr("abs:src"));
}
for (Element link : imports) {
links.add(link.attr("abs:href"));
}
for (Element link : links) {
links.add(link.attr("abs:href"));
}
Note that this method will let you get all absolute path links, even those that were relative.
If you do not need that and want to show off to your boss, try the following regex:
https?://(?:www\\.)?[a-z0-9.:][^<>"]+
See the regex demo
Here, [^<>"]+ matches 1 or more symbols other than <, > and ". Thus, it restricts the regex not to cross the attribute boundary. You might want to add ' there, too.

Related

JSoup Extracting absolute url of a href and a div tag data simultaneously

I want to extract two tags from a website beside each others(adjacently), the first tag is a href and it should be extracted as the the absolute url . the second tag is a div tag and I should extract
the data inside it.
I want the output to be as the following
100 USD http:\www.somesite..............
200 usd http:\www.thesite.............
Why? because later I will insert them into a table in a database .
I tried with the following code but I couldn't get the absolute url in addition I couldn't get rid of the tags while I want to extract the data only (without tags).
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element link : doc.select("div.rightFloat.price,a[abs:href].more-details"))
{
String absHref = url.attr("abs:href");
String attr = link.absUrl("href");
System.out.println(link);
}
If I try using
System.out.println(link.text())
in my code I will miss the hyperlink completely !
Any help please?
I don't think that Jsoup css selector combinators (i.e. the comma in the selector) guarantees an ordering in the output. At least I would not count on it, even if you find the two elements in the ordering you expect. Instead of using the comma selector, I would first loop over the outer containers that hold the adjacent divs you are interested in. Within each div you can then access the price and link.
something like this. Note, that this is out of my head and untested!
Document doc = Jsoup.connect("http://www.bezaat.com/ksa/jeddah/cars/all/1?so=77").get();
for (Element adDiv : doc.select("div.category-listing-normal-ad")){
Element priceDiv = adDiv.select("div.rightFloat.price").first();
Element linkA = adDiv.select("a.more-details").first();
System.out.println(priceDiv.text() + " " + linkA.absUrl("href"));
}

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?
I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

unable to find absolute URL

I'm writing some code to find absolute URLS of a single webpage:
http://explore.bfi.org.uk/4ce2b69ea7ef3
So far I get all the links of that page and print the absolute urls
Here is part of the code:
Elements hyperLinks = htmlDoc.select("a[href]");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}
This prints out alot or urls just like the one above. However, it seems to skip a few URLS aswell. The ones it skips are the ones I actually need.
This is one of the a[href] elements its not turning into the absolute URL:
<div class="title">Royal Review<br /></div>
It will print this line if I just print "link" but when I put "abs:href", it will just print blank.
I am new to Java and appreciate any feedback!
You shouldn't use "a[href]", use "a" instead following this example:
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
So in your case:
Elements hyperLinks = htmlDoc.select("a");
for(Element link: hyperLinks)
{
System.out.println(link.attr("abs:href"));
}

Java Regular Expression: href without hash

I'm trying to build a sitemap and parsing the html bodies for hrefs that doesn't have # (as those with hashes are just sub chapter links in some content page htmls).
My regexp now: <a\\s[^>]*href\\s*=\\s*\"([^\"]*)\"[^>]*>(.*?)</a>
I guess I should use [^#] or !# to exclude the # from hrefs but could not solve it with just trying and googling after it. Thanks in advance for helping me out!
Done it. Just inserted the # too in the [^\"] block. :D
<a\\s[^>]*href\\s*=\\s*\"([^\"#]*)\"[^>]*>(.*?)</a>
You should not use regex to parse HTML.
Best use an HTML parser, as eg http://jsoup.org and then
Document doc = Jsoup.parse(input);
Elements links = doc.select("a[href]");
for (Element each: links) {
if (each.attr("href").startsWith("#")) continue;
...
}
So much more painless than using regex, eh!

Java Jericho hyperlink parsing

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.
I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

Categories