I've made a java server that scrapes a website, but my problem is that after a few requests (about 10 or so) I always get this error ElementNotFoundException, although the element should be there. Basically my program just checks every few minutes this website for info but after a few times it just gives me that exception.
This is my code for scraping, I don't know what's wrong with it that after a few times the element is not found..
final WebClient webClient = new WebClient();
try (final WebClient webClient1 = new WebClient()) {
final HtmlPage page = webClient.getPage("http://b7rabin.iscool.co.il/מערכתשעות/tabid/217/language/he-IL/Default.aspx");
WebResponse webResponse = page.getWebResponse();
String content = webResponse.getContentAsString();
// System.out.println(content);
HtmlSelect select = (HtmlSelect) page.getElementById("dnn_ctr914_TimeTableView_ClassesList");
HtmlOption option = select.getOptionByValue("" + userClass);
select.setSelectedAttribute(option, true);
//String jscmnd = "javascript:__doPostBack('dnn$ctr914$TimeTableView$btnChangesTable','')";
String jscmnd = "__doPostBack('dnn$ctr914$TimeTableView$btnChanges','')";
ScriptResult result = page.executeJavaScript(jscmnd);
HtmlPage page1 = (HtmlPage) result.getNewPage();
String content1 = page1.getWebResponse().getContentAsString();
//System.out.println(content1);
System.out.println("-----");
HtmlDivision getChanges = null;
String changes = "";
getChanges = page1.getHtmlElementById("dnn_ctr914_TimeTableView_PlaceHolder");
changes = getChanges.asText();
changes = changes.replaceAll("\n", "").replaceAll("\r", "");
System.out.println(changes);
}
The exception:
Exception in thread "Thread-0" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[dnn_ctr914_TimeTableView_PlaceHolder]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1552)
at scrapper$1.run(scrapper.java:108)
I am really desperate to solve it, it's the only bottleneck in my project.
You just need to wait a little before manipulating the second page, as hinted here.
So, sleep() for 3 seconds would make it always succeeds.
HtmlPage page1 = (HtmlPage) result.getNewPage();
Thread.sleep(3_000); // sleep for 3 seconds
String content1 = page1.getWebResponse().getContentAsString();
Also, you don't need to instantiate two instances of WebClient.
Related
I'm a newbie to HtmlUnit, and I'm writing a demo script to load the source HTML of a webpage and write it to a txt file.
public static void main(String[] args) throws IOException {
try (final WebClient wc = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
wc.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = wc.getPage("https://www.sainsburys.co.uk/gol-ui/SearchResults/biscuits");
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
FileWriter fw = new FileWriter(dir + "/pageHtml.txt");
fw.write(html);
fw.close();
}
}
However, it returns the HTML for disabled JavaScript. To try and fix this, I added this line to ensure JS is enabled on the WebClient:
wc.getOptions().setJavaScriptEnabled(true);
Despite that, nothing changes. Am I being an idiot, or is there something more subtle that needs to change?
Thanks for any help! ^_^
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
This is the response (code) you got from the server. If you like to have the current DOM (the one after the js processing is done you can do something like
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(60_000);
System.out.println(page.asXml());
or
System.out.println(page.asNormalizedText());
I am accessing the website (http://www.bacnet.org/Addenda/) using htmlunit api library in Java.
I am able to get the contents of entire page but I would like to capture only specific area.
This is how I am fetching the page:
public static void getBACnetStandard() throws FailingHttpStatusCodeException,
MalformedURLException, IOException
{
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
String pageContent = page.asText();
System.out.println(pageContent);
}
I would like to capture the highlighted area (in RED box) from the entire page.
First get the Id of Element on HtmlPage
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_div_id and get that Element here
HtmlDivision div = page.getHtmlElementById("target_div_id");
or you want to use other custom attribute to fetch that DOM Element, like I'm giving example using target_class_value
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_class_value and get that Element here
HtmlDivision div = page.getByXPath("//div[#class='target_class_value']");
I'm trying to download a ZIP file with HTMLUnit 2.32 using the following code.
I obtain a "myfile.zip" bigger than the one downloaded through a normal browser (179kb vs 79kb) and which is corrupt.
How one should click an anchor and download a file with HTMLUnit?
WebClient wc = new WebClient(BrowserVersion.CHROME);
final String HREF_SCARICA_CONSOLIDATI = "/web/area-pubblica/quotate?viewId=export_quotate";
final String CONSOBBase = "http://www.consob.it";
HtmlPage page = wc.getPage(CONSOBBase + HREF_SCARICA_CONSOLIDATI);
final String downloadButtonXpath = "//a[contains(#href, 'javascript:downloadAzionariato()')]";
List<HtmlAnchor> downloadAnchors = page.getByXPath(downloadButtonXpath);
HtmlAnchor downloadAnchor = downloadAnchors.get(0);
UnexpectedPage downloadedFile = downloadAnchor.click();
InputStream contentAsStream = downloadedFile.getWebResponse().getContentAsStream();
File destFile = new File("/tmp", "myfile.zip");
Writer out = new OutputStreamWriter(new FileOutputStream(destFile));
IOUtils.copy(contentAsStream, out);
out.close();
Have updated your code snippet a bit to make it work. Hope the inline comments are helping a bit to understand what is going on (using the latest SNAPSHOT code of HtmlUnit (2.34-SNAPSHOT 2018/11/03)
final String HREF_SCARICA_CONSOLIDATI = "http://www.consob.it/web/area-pubblica/quotate?viewId=export_quotate";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(HREF_SCARICA_CONSOLIDATI);
final String downloadButtonXpath = "//a[contains(#href, 'javascript:downloadAzionariato()')]";
List<HtmlAnchor> downloadAnchors = page.getByXPath(downloadButtonXpath);
HtmlAnchor downloadAnchor = downloadAnchors.get(0);
// click does some javascript magic - have a look at your browser
// seems like this opens a new window with the content as response
// because of this we can ignore the page returned from click
downloadAnchor.click();
// instead of we are waiting a bit until the javascript is done
webClient.waitForBackgroundJavaScript(1000);
// now we have to pick up the window/page that was opened as result of the download
Page downloadPage = webClient.getCurrentWindow().getEnclosedPage();
// and finally we can save to content
File destFile = new File("/tmp", "myfile.zip");
try (InputStream contentAsStream = downloadPage.getWebResponse().getContentAsStream()) {
try (OutputStream out = new FileOutputStream(destFile)) {
IOUtils.copy(contentAsStream, out);
}
}
System.out.println("Output written to " + destFile.getAbsolutePath());
}
While RBRi considerations are interesting, I discovered my code worked with HTMLUnit 2.32 with no modifications but I was writing the file the wrong way!
I used
Writer out = new OutputStreamWriter(new FileOutputStream(destFile));
IOUtils.copy(contentAsStream, out);
while it had to be (no OutputStreamWriter)
FileOutputStream out = new FileOutputStream(destFile);
IOUtils.copy(contentAsStream, out);
I've got an image on an html page that is also an input.
<input type="image" src=...
I couldn't care less about clicking the image. I want to save the image to a File. It seems to be impossible which seems ridiculous. I tried casting from HtmlImageInput to HtmlImage but I just get an error. How can I do this? Do I need to switch from HtmlUnit to something else? I don't care what I need to do to get this done.
By the way, I tried using selenium and taking a screenshot but it's taking a screenshot of the wrong area. Tried multiple different xpaths to the same element and it always takes the wrong screenshot.
Thanks for reporting.
Similar to HtmlImage, .saveAs(File) has been just added to HtmlImageInput.
BTW, if you can't use latest snapshot, then you can use:
try (WebClient webClient = new WebClient()) {
HtmlPage page = webClient.getPage("http://localhost:8080");
HtmlImageInput input = page.querySelector("input");
URL url = page.getFullyQualifiedUrl(input.getSrcAttribute());
final String accept = webClient.getBrowserVersion().getImgAcceptHeader();
final WebRequest request = new WebRequest(url, accept);
request.setAdditionalHeader("Referer", page.getUrl().toExternalForm());
WebResponse imageWebResponse = webClient.loadWebResponse(request);
}
HtmlImage codeImg = (HtmlImage) findElement(xpath, index);
InputStream is = null;
byte[] data = null;
try {
is = codeImg.getWebResponse(true).getContentAsStream();
data = new byte[is.available()];
is.read(data);
} catch (IOException e) {
log.error("get img stream meets error :", e);
} finally {
IOUtils.closeQuietly(is);
}
if (ArrayUtils.isEmpty(data)) {
String errorMessage = String.format("downLoad img verify code with xpath %s failed.", xpath);
throw new EnniuCrawlException(TargetResponseError.ERROR_RESPONSE_BODY, errorMessage);
}
String base64Img = Base64Utils.encodeToString(data);
I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.
I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
}
When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.
Document doc = Jsoup.connect(url).get();
String text = doc.text();
That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.
for (Element p : paragraphs)
text+=p.text(); // Append the text.
In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.
I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.
I hope this code will fulfill your requirements without jSoup.
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}