Java Jericho hyperlink parsing - java

I'm trying to figure out a way to get all hyperlinks in a webpage - except if they are in an anchor tag().
For this I'm using the Jericho parser.
My initial approach was to take the difference between
List<Element> elementList = source.getAllElements(); and
getAllElements(HTMLElementName.A), but other elements might also contain an anchor link within them, so I don't think that's the right approach.

I recommend you Jsoup for Html processing.
Here's an example how you can get all links (= a-tag with href-attribute):
Document doc = Jsoup.connect("http:// - link here -").get(); // Connect to website and parse its html
Elements links = doc.select("a[href]"); // Select all 'a'-tags' with 'href'-attribute
for( Element element : links ) // iterate over all links (example)
{
// process element
}
Documentation:
Selector API (DOM API is available too)
Cookbook (Examples)
list links (Example)
JavaDoc
Btw. can you explain this a bit more?
except if they are in an anchor tag

Related

Reading html list items from android java code

As I explained in title, I found this website (rpg.rem.uz) that uses a list.
I wanted to read the title of each list item programmatically in my android java code. I need it to populate a listview in the same way that list is populated.
please let me know if it is possible and how to do that.
thanks in advance
EDIT
I tried using Jsoup but I get an Handshake failed exception
You can use Jsoup library for parsing HTML.
Read doc and its examples:
https://jsoup.org
Theoretically you can (android.text.Html), but practically don't.
A WebView (android.webkit.WebView) could satisfy your need, but you better think about some APIs for you site, JSON is what you need
Try this may help you. you need to set up this library first to your project then use the code below :
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Element content = doc.getElementById("content");//pass your list id
Elements links = content.getElementsByTag("li");
for (Element list: lists) {
String linkText = link.text();
}
OR
You can direct load your html from your url -:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("li");
Just read the documentation from this link. You can also see some examples here

Extracting article links only using jsoup

I am trying to use JSoup to extract links of articles from stock symbols.
For example on this page: http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases
there are a bunch of press release titles. When you press each title, you are given a link. I want to use JSoup to extract and store the links of each one of those press releases.
As of now this is what I have so far:
Document doc = Jsoup
.connect("http://finance.yahoo.com/q/p?s=AAPL+Press+Releases").get();
And to get the links I am using
Elements url = jSoupDoc.select("p").select("a");
System.out.println(url.text());
The output that I am getting is not the link only, I am getting some other information with it. Please help me tweak the .select() statements to get only the link.
Try this code:
Document document = Jsoup.connect("http://finance.yahoo.com/q/p?s=+AAPL+Press+Releases")
.get();
Element div = document.select("div.mod.yfi_quote_headline.withsky").first();
Elements aHref = div.select("a[href]");
for(Element element : aHref)
System.out.println(element.attr("abs:href"));
Output:
http://finance.yahoo.com/news/hagens-berman-payday-millions-e-161500428.html
http://finance.yahoo.com/news/swift-playgrounds-app-makes-learning-185500537.html
http://finance.yahoo.com/news/apple-previews-ios-10-biggest-185500113.html
http://finance.yahoo.com/news/powerful-siri-capabilities-single-sign-185500577.html
http://finance.yahoo.com/news/apple-previews-major-macos-sierra-185500097.html
http://finance.yahoo.com/news/apple-previews-watchos-3-faster-185500388.html
http://finance.yahoo.com/news/apple-union-square-highlights-design-173000006.html
http://finance.yahoo.com/news/apple-opens-development-office-hyderabad-043000495.html
http://finance.yahoo.com/news/apple-announces-ios-app-design-043000238.html
http://finance.yahoo.com/news/apple-celebrates-chinese-music-garageband-230000088.html
http://finance.yahoo.com/news/apple-sap-partner-revolutionize-iphone-183000583.html

collect only relevant links from url

What I need is to collect the relevant links from the url. For example from a link like http://beechplane.wordpress.com/ , i need to collect the links that contains the actual articles. ie, links like http://beechplane.wordpress.com/2012/11/07/the-95-confidence-of-nate-silver/ , http://beechplane.wordpress.com/2012/03/06/visualizing-probability-roulette/ etc.
How can I get those links in Java? Is it possible using web-crawlers?
I use jsoup library for that.
How get all <a> tags from document:
Elements a = doc.select("a");
for (Element el : a) {
//process element
String href = el.attr("href");
}

Java: Extract all links with a certain word in them with JSoup?

Might be an unclear question so here's the code and explanation:
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href=google.com/example/]");
The String exampleHtmlData contains a parsed HTML source from a certain site. This site has a lot of links which direct the user to google. A few examples would be:
http://google.com/example/hello
http://google.com/example/certaindir/anotherdir/something
http://google.com/anotherexample
I want to extract all the links that contain google.com/example/ in the link with the doc.select function. How do I do this with JSoup?
You can refer the SelectorSyntax.
Document doc = Jsoup.parse(exampleHtmlData);
Elements certainLinks = doc.select("a[href*=google.com/example/]");

dom4j XPath not working parsing xhtml document

I'm trying to use dom4j to parse an xhtml document. If I simply print out the document I can see the entire document so I know it is being loaded correctly. The two divs that I'm trying to select are at the exact same level in the document.
html
body
div
table
tbody
tr
td
table
tbody
tr
td
div class="definition"
div class="example"
My code is
List<Element> list = document.selectNodes("//html/body/div/table/tbody/tr/td/table/tbody/tr/td");
but the list is empty when i do System.out.println(list);
If i only do List<Element> list = document.selectNodes("//html"); it does actually return a list with one element in it. So I'm confused about whats wrong with my xpath and why it won't find those divs
Try declaring the xhtml namespace to the xpath, e.g. bind it to the prefix x and use //x:html/x:body... as XPath expression (see also this article which is however for Groovy, not for plain Java). Probably something like the following should do it in Java:
DefaultXPath xpath = new DefaultXPath("//x:html/x:body/...");
Map<String,String> namespaces = new TreeMap<String,String>();
namespaces.put("x","http://www.w3.org/1999/xhtml");
xpath.setNamespaceURIs(namespaces);
list = xpath.selectNodes(document);
(untested)
What about just "//div"? Or "//html/body/div/table/tbody"? I've found long literal XPath expressions hard to debug, as it's easy for my eyes to get tricked... so I break them down until it DOES work and then build back up again.
An alternative could be: -
//div[#class='definition' or #class='example']
This searches for "div" elements, anywhere in the document with "class" attributes values equal to "definition" or "example".
I find this approach more clearly illustrates what you are trying to retrieve from the page. An added benefit is if the structure of the page changes, but the div classes stay the same, then your xpath doesn't need to be updated.
You can also check your xpath works against an HTML document using the following firefox plugin which is very useful.
Firefox Plugin - XPath Checker 0.4.4

Categories