I am trying to be able to test a website that uses javascript to render most of the HTML. With the HTMLUNIT browser how would you be able to access the html generated by the javascript? I was looking through their documentation but wasn't sure what the best approach might be.
WebClient webClient = new WebClient();
HtmlPage currentPage = webClient.getPage("some url");
String Source = currentPage.asXml();
System.out.println(Source);
This is an easy way to get back the html of the page but would you use the domNode or another way to access the html generated by the javascript?
You gotta give some time for the JavaScript to execute.
Check a sample working code below. The bucket divs aren't in the original source.
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.List;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class GetPageSourceAfterJS {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF); /* comment out to turn off annoying htmlunit warnings */
WebClient webClient = new WebClient();
String url = "http://www.futurebazaar.com/categories/Home--Living-Luggage--Travel-Airbags--Duffel-bags/cid-CU00089575.aspx";
System.out.println("Loading page now: "+url);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(30 * 1000); /* will wait JavaScript to execute up to 30s */
String pageAsXml = page.asXml();
System.out.println("Contains bucket? --> "+pageAsXml.contains("bucket"));
//get divs which have a 'class' attribute of 'bucket'
List<?> buckets = page.getByXPath("//div[#class='bucket']");
System.out.println("Found "+buckets.size()+" 'bucket' divs.");
//System.out.println("#FULL source after JavaScript execution:\n "+pageAsXml);
}
}
Output:
Loading page now: http://www.futurebazaar.com/categories/Mobiles-Mobile-Phones/cid-CU00089697.aspx?Rfs=brandZZFly001PYXQcurtrayZZBrand
Contains bucket? --> true
Found 3 'bucket' divs.
HtmlUnit version used:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.12</version>
</dependency>
Assuming the issue is HTML generated by JavaScript as a result of AJAX calls, have you tried the 'AJAX does not work' section in the HtmlUnit FAQ?
There's also a section in the howtos about how to use HtmlUnit with JavaScript.
If your question isn't answered here, I think we'll need some more specifics to be able to help.
Related
I'm trying to scrape a website for data using jsoup so I can use it in an Android Studio project. When I try to use the .text method to get all the text of a document, it says "cannot resolve method" even though I think I imported all the right things. Is this a problem with my code or is it something else?
My Code:
Document doc = (Document) Jsoup.connect(url).get();
text = doc.text();
Edit: Found error, org.w3c.dom.Document was imported when the correct import is org.jsoup.nodes.Document
You need to import org.jsoup.nodes.Document.
Demo:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
public static void main(String[] args) throws IOException {
Document doc = (Document) Jsoup.connect("https://www.google.com").get();
String text = doc.text();
System.out.println(text);
}
}
Output:
Google We've detected you're using an older version of Chrome.Reinstall to stay secure × About Store We've detected you're using an older version of Chrome.Reinstall to stay secure × Gmail Images Sign in Remove Report inappropriate predictions × A privacy reminder from Google Remind me later Review now United KingdomPrivacyTermsSettingsSearch settingsAdvanced searchYour data in SearchHistorySearch helpSend feedbackAdvertisingBusiness How Search works
It seems that you are creating Document wrong. Here's how we do it:
URL url = new URL(.../*link here*/);
Document document = Jsoup.parse(url, 4000 /*timeout*/);
document.text();
I am given a url , I need to get this url html and from there get this site links .
I thought about using headless browsers . I m using java so I would like to sum it up using java process.
an example can be cnn site ...
So far I have tried using :
testCompile 'net.sourceforge.htmlunit:htmlunit:2.32'
#Test
public void htmlUnitTest() throws Exception {
try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
webClient.waitForBackgroundJavaScriptStartingBefore(20000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = webClient.getPage(URL);
WebResponse response = page.getWebResponse();
String content = response.getContentAsString();
List<HtmlAnchor> anchors = page.getAnchors();
System.out.println("anchors.size() : " + anchors.size());
System.out.println("***********");
System.out.println(content);
System.out.println("***********");
try (BufferedWriter writer = new BufferedWriter(new FileWriter("htmlUnit.txt"))) {
writer.write(content);
}
}
}
but the response I am getting the original HTML without being rendered (the java script havent worked and created the page anchors in my case )
can someone recommend on another library , or tell me if I miss using html unit and can suggest a working solution it will be very helpful.
The waitForBackgroundJavaScriptXX methods are not options; you have to call them AFTER getPage(URL) or any other interaction like click().
One of the major differences between HtmlUnit and Selenium is the integration of all parts. In HtmlUnit the javascript engine is part or the implementation, this implies that the api is able to get information about the current status. As a result waitForBackgroundJavaScriptXX methods are only waiting, if there is some javascript pending. If there is none they are no ops.
I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. To do this I need a html parser of course and I thought directly about JSoup. But my problem is how do i do it exactly? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program?
with this code:
package com.stackoverflow.q2835505;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
public static void main(String[] args) throws Exception {
String url = "https://stackoverflow.com/questions/2835505";
Document document = Jsoup.connect(url).get();
String question = document.select("#question .post-text").text();
System.out.println("Question: " + question);
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
}
}
this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News?
I can't use a rss feed as my boss wants the complete articles to be displayed.
First you need to download the main page:
Document doc = Jsoup.connect("https://globalnews.ca/world/").get();
Then you select links you are interested in for example with css selectors
You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h. Urls are in href attribute of a tag.
for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
System.out.println(e.attr("href"));
}
Then the resulting urls you can process as you wish. You can download content of the first five of then using syntax from the first line etc.
Is it possible to crawl ajax-based web sites using Heritrix-3.2.0?
If you intend to make a "copy" of an ajax website, clearly no.
If you want to grab some data by analysing the content of the website, you can customize the crawler with an Extractor that would determine which URLs to follow. On most website you can easily guess the urls that are interesting for your case without having to interpret the javascript. Then the ajax callbacks would be crawled and given to the Processor chain. By default this would store the ajax callback answers in the archive files.
Making your own Extractor looks like that:
import org.archive.modules.extractor.ContentExtractor;
import org.archive.modules.extractor.LinkContext;
import org.archive.modules.extractor.Hop;
import org.archive.io.ReplayCharSequence;
import org.archive.modules.CrawlURI;
public class MyExtractor extends ContentExtractor {
#Override
protected boolean shouldExtract(CrawlURI uri) {
return true;
}
#Override
protected boolean innerExtract(CrawlURI curi) {
try {
ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();
// ... analyse the page content cs as a CharSequence ...
// decide you want to crawl some page with url [uri] :
addOutlink( curi, uri, LinkContext.NAVLINK_MISC, Hop.NAVLINK );
}
Compile, put the jar file in the heritrix/lib directory and insert a bean refering to MyExtractor in the fetchProcessors chain : basically, duplicate the extractorHtml line in the crawl job cxml file.
I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup.
Can htmlunit accomplish that? If so, how? Or is there another library I should be looking at?
for example if the page contains
<body><p>para1 test info</p><div><p>more stuff here</p></div>
I'd like it to output
para1 test info more stuff here
thanks
http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible.
#Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
NB: the page.asText() command seems to offer exactly what you are after.
Javadoc for asText (Inherited from DomNode to HtmlPage)