How to get absolute URL parh without files - java

I need get absolute path of links without links to files. I have this code which get me links and some links there missing.
public class Main {
public static void main(String[] args) throws Exception {
URI uri = new URI("http://www.niocchi.com/");
printURLofPages(uri);
}
private static void printURLofPages(URI uri) throws IOException {
Document doc = Jsoup.connect(uri.toString()).get();
Elements links = doc.select("a[href~=^[^#]+$]");
for (Element link : links) {
String href = link.attr("abs:href");
URL url = new URL(href);
String path = url.getPath();
int lastdot = path.lastIndexOf(".");
if (lastdot > 0) {
String extension = path.substring(lastdot);
if (!extension.equalsIgnoreCase(".html") && !extension.equalsIgnoreCase(".htm"))
return;
}
System.out.println(href);
}
}
}
This code get me following links:
http://www.enormo.com/
http://www.vitalprix.com/
http://www.niocchi.com/javadoc
http://www.niocchi.com/
I need get this links:
http://www.enormo.com/
http://www.vitalprix.com/
http://www.niocchi.com/javadoc
http://www.linkedin.com/in/flmommens
http://www.linkedin.com/in/ivanprado
http://www.linkedin.com/in/marcgracia
http://es.linkedin.com/in/tdibaja
http://www.linkody.com
http://www.niocchi.com/
Thanks a lot for advices.

instead of
String href = link.attr("href");
try
String href = link.attr("abs:href");
EDIT docs: http://jsoup.org/cookbook/extracting-data/working-with-urls

Related

How to extract the text from the web site?

I'm working on code for parsing the weather site.
I found a CSS class with needed data on the web-site. How to pick up from there "on October 12" in the form of a string? (Tue, Oct 12)
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
Document page = Jsoup.parse(new URL(url), 3000);
return page;
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element Nameday = page.select("div [class=date date-2]").first();
String date = Nameday.select("div [class=date date-2").text();
System.out.println(Nameday);
}
}
The code is written for the purpose of parsing the weather site. On the page I found the right class in which only the date and day of the week I need. But at the stage of converting data from a class, an error crashes into a string.
The problem is with class selector, it should look like this: div.date.date-2
Working code example:
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
return Jsoup.parse(new URL(url), 3000);
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element dateDiv = page.select("div.date.date-2").first();
if(dateDiv != null) {
String date = dateDiv.text();
System.out.println(date);
}
}
}
Here is an answer to Your problem: Jsoup select div having multiple classes
In future, please make sure Your question is more detailed and well structured. Here is the "asking questions" guideline: https://stackoverflow.com/help/how-to-ask

Parsing website by jsoup Java

I need to check a value on https://new.ppy.sh/u/9889129 from div class profile-header-extra__rank-box (pp score). But it's returns nothig from there. How to do it?
public class mainClass
{
public static void main(String[] args) throws Exception {
String url = "https://new.ppy.sh/u/9889129";
Document document = Jsoup.connect(url).get();
String ppValue = document.select(".profile-header-extra__rank-global").text();
System.out.println("PP: "+ppValue);
}
}

Java Use URI builder?

I have bellow url :
http://www.example.com/api/Video/GetListMusicRelated/0/0/null/105358/0/0/10/null/null
This section is Fixed and unchangeable:
http://www.example.com/api/Video/GetListMusicRelated/
I set parameter to this url like bellow :
http://www.example.com/api/Video/GetListMusicRelated/25/60/jim/105358/20/1/5/null/null
OR :
http://www.example.com/api/Video/GetListMusicRelated/0/0/null/105358,5875,85547/0/0/10/null/null
How I can write for this url a url builder ?
If you want to create an UrlBuilder using the builder pattern, it could be done like this:
public class UrlBuilder {
private final String root;
private int myParam1;
private String myParam2;
public UrlBuilder(final String root) {
this.root = root;
}
public UrlBuilder myParam1(int myParam1) {
this.myParam1 = myParam1;
return this;
}
public UrlBuilder myParam2(String myParam2) {
this.myParam2 = myParam2;
return this;
}
public URL build() throws MalformedURLException {
return new URL(
String.format("%s/%d/%s", root, myParam1, myParam2)
);
}
}
Then you will be able to create your URL as next
URL url = new UrlBuilder("http://www.example.com/api/Video/GetListMusicRelated")
.myParam1(25)
.myParam2("jim")
.build();
NB: This only shows the idea, so I used fake parameter's name and incorrect number of parameters, please note that you are supposed to have 6 parameters and set the proper names.
try this...
URL domain = new URL("http://example.com");
URL url = new URL(domain + "/abcd/abcd");

Get a part of a webpage using JSOUP

I am trying to programmatically search for a word meaning in google & save its meaning in a file in my computer. I have successfully called the page & get the response in Document (org.jsoup.nodes.Document). Now I do not know how to get only the word meaning from this Document. Please find the screenshot where I have indicated the part of data that I need.
The response html is so big that I can't understand from which element I will get my desired data. Please help. Here is what I have done so far:
public class Search {
private static Pattern patternDomainName;
private Matcher matcher;
private static final String DOMAIN_NAME_PATTERN
= "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
static {
patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
}
public static void main(String[] args) {
Search obj = new Search();
Set<String> result = obj.getDataFromGoogle("debug%20meaning");
for(String temp : result){
System.out.println(temp);
}
System.out.println(result.size());
}
public String getDomainName(String url){
String domainName = "";
matcher = patternDomainName.matcher(url);
if (matcher.find()) {
domainName = matcher.group(0).toLowerCase().trim();
}
return domainName;
}
private Set<String> getDataFromGoogle(String query) {
Set<String> result = new HashSet<String>();
String request = "https://www.google.com/search?q=" + query + "&num=20";
System.out.println("Sending request..." + request);
try {
// need http protocol, set this as a Google bot agent :)
Document doc = Jsoup
.connect(request)
.userAgent(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
/**********Here comes my data fetching logic*****************
* Dont know where to fing my desired data in such a big html
*/
/*
String sc = doc.html().replaceAll("\\n", "");
System.out.println(doc.html());
*/
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}
Google Dictionary API is deprecated!
But instead scraping through google search URI,which is what you are doing currently, you can do the same thing using this http://google-dictionary.so8848.com/ service which preferably more easy to scrape data from, with what you are doing currently.

How to get the content from a website using Jsoup

I amm trying to get the data from a website. With this code:
#WebServlet(description = "get content from teamforge", urlPatterns = { "/JsoupEx" })
public class JsoupEx extends HttpServlet {
private static final long serialVersionUID = 1L;
private static final String URL = "http://www.moving.com/real-estate/city-profile/results.asp?Zip=60505";
public JsoupEx() {
super();
}
protected void doGet(HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
Document doc = Jsoup.connect(URL).get();
for (Element table : doc.select("table.DataTbl")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 1) {
System.out.println(tds.get(0).text() + ":"
+ tds.get(2).text());
}
}
}
}
}
I am using the jsoup parser. When run, I do not get any errors, just no output.
Please help on this.
With the following code
public class Tester {
private static final String URL = "http://www.moving.com/real-estate/city-profile/results.asp?Zip=60505";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(URL).get();
System.out.println(doc);
}
}
I get a java.net.SocketTimeoutException: Read timed out. I think the particuliar URL you are trying to crawl is too slow for Jsoup. Being in Europe, my connection might be slower as yours. However you might want to check for this exception in the log of your AS.
By setting the timeout to 10 seconds, I was able to download and parse the document :
Connection connection = Jsoup.connect(URL);
connection.timeout(10000);
Document doc = connection.get();
System.out.println(doc);
With the rest of your code I get :
Population:78,413
Population Change Since 1990:53.00%
Population Density:6,897
Male:41,137
Female:37,278
.....
thanx Julien, I tried with the following code, getting SocketTimeoutException. And code is
Connection connection=Jsoup.connect("http://www.moving.com/real-estate/city-
profile/results.asp?Zip=60505");
connection.timeout(10000);
Document doc = connection.get();
System.out.println(doc);

Categories