collect only relevant links from url - java

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?

I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

Related

Reading html list items from android java code

As I explained in title, I found this website (rpg.rem.uz) that uses a list.
I wanted to read the title of each list item programmatically in my android java code. I need it to populate a listview in the same way that list is populated.
please let me know if it is possible and how to do that.
thanks in advance
EDIT
I tried using Jsoup but I get an Handshake failed exception
You can use Jsoup library for parsing HTML.
Read doc and its examples:
https://jsoup.org
Theoretically you can (android.text.Html), but practically don't.
A WebView (android.webkit.WebView) could satisfy your need, but you better think about some APIs for you site, JSON is what you need
Try this may help you. you need to set up this library first to your project then use the code below :
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");//pass your list id
Elements links = content.getElementsByTag("li");
for (Element list: lists) {
String linkText = link.text();
}
OR
You can direct load your html from your url -:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("li");
Just read the documentation from this link. You can also see some examples here

Extracting article links only using jsoup

I am trying to use JSoup to extract links of articles from stock symbols.
For example on this page: http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases
there are a bunch of press release titles. When you press each title, you are given a link. I want to use JSoup to extract and store the links of each one of those press releases.
As of now this is what I have so far:
Document doc = Jsoup
.connect("http://finance.yahoo.com/q/p?s=AAPL+Press+Releases").get();
And to get the links I am using
Elements url = jSoupDoc.select("p").select("a");
System.out.println(url.text());
The output that I am getting is not the link only, I am getting some other information with it. Please help me tweak the .select() statements to get only the link.
Try this code:
Document document = Jsoup.connect("http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases")
.get();
Element div = document.select("div.mod.yfi_quote_headline.withsky").first();
Elements aHref = div.select("a[href]");
for(Element element : aHref)
System.out.println(element.attr("abs:href"));
Output:
http://finance.yahoo.com/news/hagens-berman-payday-millions-e-161500428.html
http://finance.yahoo.com/news/swift-playgrounds-app-makes-learning-185500537.html
http://finance.yahoo.com/news/apple-previews-ios-10-biggest-185500113.html
http://finance.yahoo.com/news/powerful-siri-capabilities-single-sign-185500577.html
http://finance.yahoo.com/news/apple-previews-major-macos-sierra-185500097.html
http://finance.yahoo.com/news/apple-previews-watchos-3-faster-185500388.html
http://finance.yahoo.com/news/apple-union-square-highlights-design-173000006.html
http://finance.yahoo.com/news/apple-opens-development-office-hyderabad-043000495.html
http://finance.yahoo.com/news/apple-announces-ios-app-design-043000238.html
http://finance.yahoo.com/news/apple-celebrates-chinese-music-garageband-230000088.html
http://finance.yahoo.com/news/apple-sap-partner-revolutionize-iphone-183000583.html

Modifying a regex for a better matching

Considering we have this regex to match URLs in a page:
(https?):\\/\\/(www\\.)?[a-z0-9\\.:].*?(?=\\s)
My question is that how we can improve it to, for example, matches:
http://stackoverflow.com/questions/ask
instead of :
http://stackoverflow.com/questions/ask">home</a></div><div>
In short, I want it to filter any of ;:'".,<>?«»“”‘’ that usually comes after URLs in a page HTML code.
Since you already are using JSOUP, I think the best way to get all links is using this library. It is not losing originality, it is a question of your code safety, readability and maintainability.
Here is an example from Jsoup Cookbook how you can collect all links from href and src attributes (basically, from your regex I see that you only want to match those):
List<String> links = new ArrayList<String>();
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
for (Element src : media) {
links.add(src.attr("abs:src"));
}
for (Element link : imports) {
links.add(link.attr("abs:href"));
}
for (Element link : links) {
links.add(link.attr("abs:href"));
}
Note that this method will let you get all absolute path links, even those that were relative.
If you do not need that and want to show off to your boss, try the following regex:
https?://(?:www\\.)?[a-z0-9.:][^<>"]+
See the regex demo
Here, [^<>"]+ matches 1 or more symbols other than <, > and ". Thus, it restricts the regex not to cross the attribute boundary. You might want to add ' there, too.

Java Jericho hyperlink parsing

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.
I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

Java: Extract all links with a certain word in them with JSoup?

Might be an unclear question so here's the code and explanation:
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href=google.com/example/]");
The String exampleHtmlData contains a parsed HTML source from a certain site. This site has a lot of links which direct the user to google. A few examples would be:
http://google.com/example/hello
http://google.com/example/certaindir/anotherdir/something
http://google.com/anotherexample
I want to extract all the links that contain google.com/example/ in the link with the doc.select function. How do I do this with JSoup?
You can refer the SelectorSyntax.
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href*=google.com/example/]");

Categories