Extracting article links only using jsoup - java

I am trying to use JSoup to extract links of articles from stock symbols.
For example on this page: http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases
there are a bunch of press release titles. When you press each title, you are given a link. I want to use JSoup to extract and store the links of each one of those press releases.
As of now this is what I have so far:
Document doc = Jsoup
.connect("http://finance.yahoo.com/q/p?s=AAPL+Press+Releases").get();
And to get the links I am using
Elements url = jSoupDoc.select("p").select("a");
System.out.println(url.text());
The output that I am getting is not the link only, I am getting some other information with it. Please help me tweak the .select() statements to get only the link.

Try this code:
Document document = Jsoup.connect("http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases")
.get();
Element div = document.select("div.mod.yfi_quote_headline.withsky").first();
Elements aHref = div.select("a[href]");
for(Element element : aHref)
System.out.println(element.attr("abs:href"));
Output:
http://finance.yahoo.com/news/hagens-berman-payday-millions-e-161500428.html
http://finance.yahoo.com/news/swift-playgrounds-app-makes-learning-185500537.html
http://finance.yahoo.com/news/apple-previews-ios-10-biggest-185500113.html
http://finance.yahoo.com/news/powerful-siri-capabilities-single-sign-185500577.html
http://finance.yahoo.com/news/apple-previews-major-macos-sierra-185500097.html
http://finance.yahoo.com/news/apple-previews-watchos-3-faster-185500388.html
http://finance.yahoo.com/news/apple-union-square-highlights-design-173000006.html
http://finance.yahoo.com/news/apple-opens-development-office-hyderabad-043000495.html
http://finance.yahoo.com/news/apple-announces-ios-app-design-043000238.html
http://finance.yahoo.com/news/apple-celebrates-chinese-music-garageband-230000088.html
http://finance.yahoo.com/news/apple-sap-partner-revolutionize-iphone-183000583.html

Related

Jsoup extract Hrefs from the HTML content

My problem is that I try to get the Hrefs from this site with JSoup
https://www.amazon.de/s?k=kissen&__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&ref=nb_sb_noss_2
but it does not work.
I tried to select the class from the Href like this
Elements elements = documentMainSite.select(".a-link-normal");
and after that I tried to extract the Hrefs with the following piece of code.
for (Element element : elements) {
String href = element.attributes().get("href");
}
but unfortunately it gives me nothing...
Can someone tell me where is my mistake please?
I don't just connect to the website. I also save the hrefs in a string by extracting them with
String href = element.attributes().get("href");
after that I've print the href String but is empty.
On another side the code works with another css selector. so it has nothing to do with the code by it self. its just the css selector (.a-link-normal) that is probably wrong.
You won't get anything by simply connecting to the url via Jsoup.
Document document = Jsoup.connect(yourUrl).get();
String bodyText = document.getElementsByTag("body").get(0).text();
Here is the translation of the body text, which I got from the above code.
Enter the characters below We ask for your understanding and want to
be sure that you are not a bot. For best results, please use a browser
that accepts cookies. Type the characters you see in the image: Enter
characters Try another image Continue shopping Terms & Conditions
Privacy Policy © 1996-2015, Amazon.com, Inc. or its affiliates
Either you need to bypass captcha or emulate a browser by means of Selenium, for example.

Reading html list items from android java code

As I explained in title, I found this website (rpg.rem.uz) that uses a list.
I wanted to read the title of each list item programmatically in my android java code. I need it to populate a listview in the same way that list is populated.
please let me know if it is possible and how to do that.
thanks in advance
EDIT
I tried using Jsoup but I get an Handshake failed exception
You can use Jsoup library for parsing HTML.
Read doc and its examples:
https://jsoup.org
Theoretically you can (android.text.Html), but practically don't.
A WebView (android.webkit.WebView) could satisfy your need, but you better think about some APIs for you site, JSON is what you need
Try this may help you. you need to set up this library first to your project then use the code below :
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");//pass your list id
Elements links = content.getElementsByTag("li");
for (Element list: lists) {
String linkText = link.text();
}
OR
You can direct load your html from your url -:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("li");
Just read the documentation from this link. You can also see some examples here

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?
I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

Java Jericho hyperlink parsing

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.
I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

Java: Extract all links with a certain word in them with JSoup?

Might be an unclear question so here's the code and explanation:
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href=google.com/example/]");
The String exampleHtmlData contains a parsed HTML source from a certain site. This site has a lot of links which direct the user to google. A few examples would be:
http://google.com/example/hello
http://google.com/example/certaindir/anotherdir/something
http://google.com/anotherexample
I want to extract all the links that contain google.com/example/ in the link with the doc.select function. How do I do this with JSoup?
You can refer the SelectorSyntax.
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href*=google.com/example/]");

Categories