[JAVA]Get html link from webpage - java

I want to get the link in this pic using java, image is below. There are few more links in that webpage. I found this code on stackoverflow, I don't understand how to use it though.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class weber{
public static void main(String[] args)throws Exception{
String url = "http://www.skyovnis.com/category/ufology/";
Document doc = Jsoup.connect(url).get();
/*String question = doc.select("#site-inner").text();
System.out.println("Question: " + question);*/
Elements anser = doc.select("#container .entry-title a");
for (Element anse : anser){
System.out.println("Answer: " + anse.text());
}
}
}
code is edited from the original I found tho. please help.

For your URL following code works fine.
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.skyovnis.com/category/ufology/").userAgent("Mozilla").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links (this is what you want)
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
output was
title : Ufology
link : http://www.shop.skyovnis.com/
text : Shop
link : http://www.shop.skyovnis.com/product-category/books/
text : Books
Following code filter the links by text of it.
for (Element link : links) {
if(link.text().contains("Arecibo Message"))//find the link with some texts
{
System.out.println("here is the element you need");
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
}
It’s recommended to specify a “userAgent” in Jsoup, to avoid HTTP 403 error messages.
Document doc = Jsoup.connect("http://anyurl.com").userAgent("Mozilla").get();
"Onna malli mage yuthukama kala."
refernce :
https://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

Related

How to get scrape specific URL from multiple URL in Webpage Java

I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted changes to the exempt offering framework and ignore other links. How do I do that in Java? I was able to extract all URL but not sure how to get specific one. Below is my code
while (rs.next()) {
String Content = rs.getString("Content");
doc = Jsoup.parse(Content);
//email extract
Pattern p = Pattern.compile("[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+");
Matcher matcher = p.matcher(doc.text());
Set<String> emails = new HashSet<String>();
while (matcher.find()) {
emails.add(matcher.group());
}
System.out.println(emails);
//title extract
String title = doc.title();
System.out.println("Title: " + title);
}
Elements links = doc.select("a");
for(Element link: links) {
String url = link.attr("href");
System.out.println("\nlink :"+ url);
System.out.println("text: " + link.text());
}
System.out.println("Getting all the images");
Elements image = doc.getElementsByTag("img");
for(Element src:image) {
System.out.println("src "+ src.attr("abs:src"));
}

How can I set link inside a text in Android?

So, I am using Jsoup for web scraping. I can scrape the data from the web, But, the problem is I am getting the links and text separately. I want those links to set inside my texts. I am using SpannableStringBuilder so, there are a lot of links and a lot of texts. so I can't figure out how to deal with the problem as I am new to android development.
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final SpannableStringBuilder
builder = new SpannableStringBuilder();
try {
Document doc = Jsoup.
connect("https://www.wikipedia.org/").get();
String title = doc.title();
Elements links = doc.select("a[href]");
builder.append(title).append("\n");
for (Element link : links) {
final String url = link.attr("href");
builder.append("\n")
.append("Link: ")
.append(url, new URLSpan(url),
Spannable.SPAN_EXCLUSIVE_EXCLUSIVE)
.append("\n")
.append("Text: ")
.append(link.text());
}
} catch (IOException e) {
builder.append("Error : ")
.append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
textView.setMovementMethod
(LinkMovementMethod.getInstance());
}
});
}
}).start();}
I am getting output like this format.
Link : //en.wikipedia.org/
Text : English 5 678 000+ articles
Link : //ja.wikipedia.org/
Text : 日本語 1 112 000+ 記事
Link : //es.wikipedia.org/
Text : Español 1 430 000+ artículos
......
......
I want to have an output like this format,
** Texts: English 5 678 000+ articles**,
inside that line, I want to
join this link
** Link://en.wikipedia.org/**
as hyperlinked or in some way so that I can click this text and go to the webpage directly like in MS Word.
You are looking for setting text values using HTML. Here is the documentation and Here is some sample code:
String str = "Do you want to search on " + "<a href=http//www.google.com>" +
"Google" + "</a>" + " or " + "<a href=http//www.yahoo.com>" +
"Yahoo" + "</a>" + "?";
if(Build.VERSION.SDK_INT >= 24) {
viewToSet.setText(Html.fromHtml(str, Html.FROM_HTML_MODE_LEGACY));
} else {
viewToSet.setText(Html.fromHtml(str));
}
In it, you can set values using HTML. You can also update colors, bold, italics, etc, as long as you utilize HTML properties.

JSOUP - parse a website and find information from it

I played a little with JSOUP, but I can't get the information I want with it from a website and I need some help.
For example I have this website, and I would like to extract some info like this :
-ROCK
--ACID ROCK
--PSYCHEDELIC ROCK
--BLUES ROCK
-----Aerosmith
----------One Way Street
-----AC/DC
----------Ain't No Fun (Waiting Round to Be a Millionaire)
With other words ... I want a list with genres containing lists with artist containing lists with songs ...
-Genre1
--Artist1
---Song1
---Song2
---Song3
--Artist2
---Song1
-Genre2
...
This is what i have so far (sorry for the messy code):
package parser;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParser {
public static void main(String[] args) {
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>Music</title>"
+ "</head>"
+ "<body>"
+ "<table><tr><td><h1>Artists</h1></tr>"
+ "</table>"
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String genre = "genres";
String artist = "artist";
String album = "album";
String song = "song";
//Document htmlFile = null;
//Element div = html.body().getElementsByAttributeValueMatching(genre, null);
// String div = html.body().getElementsByAttributeValueMatching(genre, null).text();
//String cssClass = div.className();
List genreList = new ArrayList();
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
String h2 = html.body().getElementsByClass(genre).text();
String gen = html.body().getAllElements().text();
Document doc;
try {
doc = Jsoup.connect("http://allmusic.com/").get();
title = doc.title();
h1 = doc.text();
h2 = doc.text();
} catch (IOException e)
{
e.printStackTrace();
}
System.out.println("Title : " + title);
//System.out.println("h1 : "+ h1);
//System.out.println("h2 : "+ h2);
System.out.println("gen : all elements : " + gen);
}
}
And this is my output:
Title : AllMusic | Record Reviews, Streaming Songs, Genres & Bands
gen : all elements : Artists Artists Artists Artists Artists Artists
I haven't got so far ...
I don't know how to extract the information ... (e.g. type of genres, artists names ...)

Find element with given text using jsoup

I have a table like in this fiddle. I need to find data related to the row which contains given text.
For example, by providing 1707, I need to get all data in table row which contains 1707. So output should be as below.
Tuesday 2014-08-05 1707 33 43 47 52 image text
Currently I'm accessing data on html page as below.
Document doc;
try {
doc = Jsoup
.connect("url here").timeout(300000).userAgent("Mozilla").get();
Element table = doc.select("table#customers").first();
if (table != null) {
Iterator<Element> iterator = table.select("td").iterator();
while (iterator.hasNext()) {
System.out.println("Day : " + iterator.next().text());
System.out.println("Date : " + iterator.next().text());
System.out.println("Draw : " + iterator.next().text());
System.out.println("No1 : " + iterator.next().text());
System.out.println("No2 : " + iterator.next().text());
System.out.println("No3 : " + iterator.next().text());
System.out.println("No4 : " + iterator.next().text());
System.out.println("Symbol : " + iterator.next().text());
System.out.println("Non : " + iterator.next().text());
}
} else {
System.out
.println("No results were found according to search criteria.");
}
} catch (IOException e) {
e.printStackTrace();
}
}
The above code return all data on table. But I need to get data related to given text.
How could I achieve this?
As shown in jsoup documentation you can use the pseudo-selector :contains(text):
table.select("tr:contains(1707) td")
You can try it here

JSoup Parse text and links in sequence from html file

I am trying to extract the text and links from an html file. At the moment i can extract both easily using JSoup but i can only do it seperately.
Here is my code:
try {
doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for(Element p : paragraphs){
// System.out.println(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
getGui().setTextVers(p.text()+ "\r\n" + "***********************************************************" + "\r\n");
}
Elements links = doc.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
getGui().setTextVers("\n\n"+link.text() + ">\r\n" +linkHref + "\r\n");
}
}
I have placed a .text class on the outer most td where there is text. what i would like to achieve is: When the program finds a td with the .text class it checks it for any links and extracts them from that section in order. So you would have:
Text
Link
Text
Link
I tried putting an inner for each loop into the first foreach loop but this only printed the full list of links for the page, can anyone help?
Try
Document doc = (Document) Jsoup.parse(new File(input), "UTF-8");
Elements paragraphs = ((Element) doc).select("td.text");
for (Element p : paragraphs) {
System.out.println(p.text());
Elements links = p.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println("\n\n" + linkText + ">\r\n" + linkHref + "\r\n");
}
}

Categories