I'm trying to scrape a website for data using jsoup so I can use it in an Android Studio project. When I try to use the .text method to get all the text of a document, it says "cannot resolve method" even though I think I imported all the right things. Is this a problem with my code or is it something else?
My Code:
Document doc = (Document) Jsoup.connect(url).get();
text = doc.text();
Edit: Found error, org.w3c.dom.Document was imported when the correct import is org.jsoup.nodes.Document
You need to import org.jsoup.nodes.Document.
Demo:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) throws IOException {
Document doc = (Document) Jsoup.connect("https://www.google.com").get();
String text = doc.text();
System.out.println(text);
}
}
Output:
Google We've detected you're using an older version of Chrome.Reinstall to stay secure × About Store We've detected you're using an older version of Chrome.Reinstall to stay secure × Gmail Images Sign in Remove Report inappropriate predictions × A privacy reminder from Google Remind me later Review now United KingdomPrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch helpSend feedbackAdvertisingBusiness How Search works
It seems that you are creating Document wrong. Here's how we do it:
URL url = new URL(.../*link here*/);
Document document = Jsoup.parse(url, 4000 /*timeout*/);
document.text();
Related
first of all, im a newbie, coding since max 2 months and this is my first question, bc i cant believe this is impossible, but i cant find the solution after googling. I hope i can get help here.I have the following problem. I want to extract a link(marked) out of a construct of div ids and div classes(marked), which i cant just easily access via the source code of a website. In the source code there is just a div id opened and closed (the id react-root), where actually the needed data is (found it via inspect element). I m googling since 10h+ and just cant get a code snippet which gives me desired link in java. I already tried some stuff with jsoup.:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class HTMLParserExample1 {
public static void main(String[] args) {
Document doc;
try {
// need http protocol
doc = Jsoup.connect("https://www.challengermode.com/teams/fb475ef0-d9c8-e811-bce7-000d3a214d8f/members").get();
var x = doc.getElementById("react-root");
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
System.out.println("text : " + link.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
So if you like to use JSoup for processing your data, have a look at the chapters under Extracting Data:
DOM Navigation (low level but extremely flexible)
Selector Syntax (extremely efficient)
If you want to change to other libraries, DOM, XPath and CSS Selectors should be good keywords to lookup.
Edit: In your code I can see you already tried parts of each. Stick with the doc.select() call but just focus on the string you have in there. Check CSS Selectors to find out they can do an awful lot.
I am trying to run this code for scraping Artist names for my research project but the code doesn't work. I don't know what I am doing wrong and since I am an absolute beginner in Java I would be pleased to get help.
I am using JSoup and I am trying to get the Artist names from this Website: https://www.choon.co/playlists/genre_12-bar-blues/12-bar-blues and similar Websites. Every time I run the Code I don't get any failure notice it just says: "Fetching...", "Done!". But nothing in between. I tried it with every Class possible but this seems not to be the problem.
I would be pleased to get some help.
package com.manu.scraper;
import java.io.IOException;
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import
org.jsoup.nodes.Element; import org.jsoup.select.Elements;
public class Spider {
public static void main(String[] args) {
System.out.println("Fetching...");
try { Document doc= Jsoup.connect("https://www.choon.co/playlists/genre_12-bar-blues/12-bar-blues").userAgent("Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/33.0.1750.152 Safari/537.36").get();
Elements temp=doc.select("class.ch-track-list__cell");
int i=0; for(Element artistList:temp) {
i++;
System.out.println(i+ " " + artistList.getElementsByTag("a").first().text() ); }
System.out.println("Done!"); }
catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }
} }
Thanks in advance and sorry if this is a total beginner move. :)
The reason Jsoup doesn't detect any HTML elements is because the elements on the page are rendered dynamically using JS. Jsoup is an HTML parser and will parse the page's HTML and allow you to select the elements you want. However, if you look at the source for the page you are requesting you will see that the elements you are selecting do not exist and that it mostly and empty HTML shell. The content you see in the browser is added to the DOM dynamically, not in the source HTML it self.
I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. To do this I need a html parser of course and I thought directly about JSoup. But my problem is how do i do it exactly? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program?
with this code:
package com.stackoverflow.q2835505;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();
String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}
}
this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News?
I can't use a rss feed as my boss wants the complete articles to be displayed.
First you need to download the main page:
Document doc = Jsoup.connect("https://globalnews.ca/world/").get();
Then you select links you are interested in for example with css selectors
You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h. Urls are in href attribute of a tag.
for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
System.out.println(e.attr("href"));
}
Then the resulting urls you can process as you wish. You can download content of the first five of then using syntax from the first line etc.
I am trying to find elements in an html page using Jsoup and I need to use the getElementsByAttributeValue class along with all the getElement classes.
The error showing in netbeans 8.0 is:
"cant find symbol"
so I suppose I haven't import the proper class in the head of the program. So what do I have to "import" to be able to use the getElement classes or if the problem is not in the "import" thing whats going on?
I can not use any of getElement by the way (is not only getElementsByAttributeValue) and I am using other classes of Jsoup like select with no problem.
getElementsByAttributeValue is the method of the class org.jsoup.nodes.Element so if you have an object of this class you should be able to access the method.
Ok, some how i solved the problem with this code:
`package sectors;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
public class TomaDatos {
public void nombre()throws IOException{
String ubic = "D:\\Varios\\Trabajo\\Bolsa\\sectorpp.docx";
FileInputStream file = new FileInputStream(new File(ubic));
XWPFDocument doc = new XWPFDocument(file);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
}
}`
I really don´t know where exactly the problem was. The only important change was add XWPF after the "new" sentence here: XWPFWordExtractor ex = new XWPFWordExtractor(doc); But beleave me i tried that before and didnt work. Maybe is a problem with netbeans. I am sorry that i can not find the error, anyway, hope this work for someone....
I am trying to be able to test a website that uses javascript to render most of the HTML. With the HTMLUNIT browser how would you be able to access the html generated by the javascript? I was looking through their documentation but wasn't sure what the best approach might be.
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("some url");
String Source = currentPage.asXml();
System.out.println(Source);
This is an easy way to get back the html of the page but would you use the domNode or another way to access the html generated by the javascript?
You gotta give some time for the JavaScript to execute.
Check a sample working code below. The bucket divs aren't in the original source.
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class GetPageSourceAfterJS {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); /* comment out to turn off annoying htmlunit warnings */
WebClient webClient = new WebClient();
String url = "http://www.futurebazaar.com/categories/Home--Living-Luggage--Travel-Airbags--Duffel-bags/cid-CU00089575.aspx";
System.out.println("Loading page now: "+url);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(30 * 1000); /* will wait JavaScript to execute up to 30s */
String pageAsXml = page.asXml();
System.out.println("Contains bucket? --> "+pageAsXml.contains("bucket"));
//get divs which have a 'class' attribute of 'bucket'
List<?> buckets = page.getByXPath("//div[#class='bucket']");
System.out.println("Found "+buckets.size()+" 'bucket' divs.");
//System.out.println("#FULL source after JavaScript execution:\n "+pageAsXml);
}
}
Output:
Loading page now: http://www.futurebazaar.com/categories/Mobiles-Mobile-Phones/cid-CU00089697.aspx?Rfs=brandZZFly001PYXQcurtrayZZBrand
Contains bucket? --> true
Found 3 'bucket' divs.
HtmlUnit version used:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.12</version>
</dependency>
Assuming the issue is HTML generated by JavaScript as a result of AJAX calls, have you tried the 'AJAX does not work' section in the HtmlUnit FAQ?
There's also a section in the howtos about how to use HtmlUnit with JavaScript.
If your question isn't answered here, I think we'll need some more specifics to be able to help.