I'm a newbie to HtmlUnit, and I'm writing a demo script to load the source HTML of a webpage and write it to a txt file.
public static void main(String[] args) throws IOException {
try (final WebClient wc = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
wc.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = wc.getPage("https://www.sainsburys.co.uk/gol-ui/SearchResults/biscuits");
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
FileWriter fw = new FileWriter(dir + "/pageHtml.txt");
fw.write(html);
fw.close();
}
}
However, it returns the HTML for disabled JavaScript. To try and fix this, I added this line to ensure JS is enabled on the WebClient:
wc.getOptions().setJavaScriptEnabled(true);
Despite that, nothing changes. Am I being an idiot, or is there something more subtle that needs to change?
Thanks for any help! ^_^
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
This is the response (code) you got from the server. If you like to have the current DOM (the one after the js processing is done you can do something like
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(60_000);
System.out.println(page.asXml());
or
System.out.println(page.asNormalizedText());
Related
I am accessing the website (http://www.bacnet.org/Addenda/) using htmlunit api library in Java.
I am able to get the contents of entire page but I would like to capture only specific area.
This is how I am fetching the page:
public static void getBACnetStandard() throws FailingHttpStatusCodeException,
MalformedURLException, IOException
{
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
String pageContent = page.asText();
System.out.println(pageContent);
}
I would like to capture the highlighted area (in RED box) from the entire page.
First get the Id of Element on HtmlPage
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_div_id and get that Element here
HtmlDivision div = page.getHtmlElementById("target_div_id");
or you want to use other custom attribute to fetch that DOM Element, like I'm giving example using target_class_value
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_class_value and get that Element here
HtmlDivision div = page.getByXPath("//div[#class='target_class_value']");
I'm trying to download a ZIP file with HTMLUnit 2.32 using the following code.
I obtain a "myfile.zip" bigger than the one downloaded through a normal browser (179kb vs 79kb) and which is corrupt.
How one should click an anchor and download a file with HTMLUnit?
WebClient wc = new WebClient(BrowserVersion.CHROME);
final String HREF_SCARICA_CONSOLIDATI = "/web/area-pubblica/quotate?viewId=export_quotate";
final String CONSOBBase = "http://www.consob.it";
HtmlPage page = wc.getPage(CONSOBBase + HREF_SCARICA_CONSOLIDATI);
final String downloadButtonXpath = "//a[contains(#href, 'javascript:downloadAzionariato()')]";
List<HtmlAnchor> downloadAnchors = page.getByXPath(downloadButtonXpath);
HtmlAnchor downloadAnchor = downloadAnchors.get(0);
UnexpectedPage downloadedFile = downloadAnchor.click();
InputStream contentAsStream = downloadedFile.getWebResponse().getContentAsStream();
File destFile = new File("/tmp", "myfile.zip");
Writer out = new OutputStreamWriter(new FileOutputStream(destFile));
IOUtils.copy(contentAsStream, out);
out.close();
Have updated your code snippet a bit to make it work. Hope the inline comments are helping a bit to understand what is going on (using the latest SNAPSHOT code of HtmlUnit (2.34-SNAPSHOT 2018/11/03)
final String HREF_SCARICA_CONSOLIDATI = "http://www.consob.it/web/area-pubblica/quotate?viewId=export_quotate";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(HREF_SCARICA_CONSOLIDATI);
final String downloadButtonXpath = "//a[contains(#href, 'javascript:downloadAzionariato()')]";
List<HtmlAnchor> downloadAnchors = page.getByXPath(downloadButtonXpath);
HtmlAnchor downloadAnchor = downloadAnchors.get(0);
// click does some javascript magic - have a look at your browser
// seems like this opens a new window with the content as response
// because of this we can ignore the page returned from click
downloadAnchor.click();
// instead of we are waiting a bit until the javascript is done
webClient.waitForBackgroundJavaScript(1000);
// now we have to pick up the window/page that was opened as result of the download
Page downloadPage = webClient.getCurrentWindow().getEnclosedPage();
// and finally we can save to content
File destFile = new File("/tmp", "myfile.zip");
try (InputStream contentAsStream = downloadPage.getWebResponse().getContentAsStream()) {
try (OutputStream out = new FileOutputStream(destFile)) {
IOUtils.copy(contentAsStream, out);
}
}
System.out.println("Output written to " + destFile.getAbsolutePath());
}
While RBRi considerations are interesting, I discovered my code worked with HTMLUnit 2.32 with no modifications but I was writing the file the wrong way!
I used
Writer out = new OutputStreamWriter(new FileOutputStream(destFile));
IOUtils.copy(contentAsStream, out);
while it had to be (no OutputStreamWriter)
FileOutputStream out = new FileOutputStream(destFile);
IOUtils.copy(contentAsStream, out);
I've got an image on an html page that is also an input.
<input type="image" src=...
I couldn't care less about clicking the image. I want to save the image to a File. It seems to be impossible which seems ridiculous. I tried casting from HtmlImageInput to HtmlImage but I just get an error. How can I do this? Do I need to switch from HtmlUnit to something else? I don't care what I need to do to get this done.
By the way, I tried using selenium and taking a screenshot but it's taking a screenshot of the wrong area. Tried multiple different xpaths to the same element and it always takes the wrong screenshot.
Thanks for reporting.
Similar to HtmlImage, .saveAs(File) has been just added to HtmlImageInput.
BTW, if you can't use latest snapshot, then you can use:
try (WebClient webClient = new WebClient()) {
HtmlPage page = webClient.getPage("http://localhost:8080");
HtmlImageInput input = page.querySelector("input");
URL url = page.getFullyQualifiedUrl(input.getSrcAttribute());
final String accept = webClient.getBrowserVersion().getImgAcceptHeader();
final WebRequest request = new WebRequest(url, accept);
request.setAdditionalHeader("Referer", page.getUrl().toExternalForm());
WebResponse imageWebResponse = webClient.loadWebResponse(request);
}
HtmlImage codeImg = (HtmlImage) findElement(xpath, index);
InputStream is = null;
byte[] data = null;
try {
is = codeImg.getWebResponse(true).getContentAsStream();
data = new byte[is.available()];
is.read(data);
} catch (IOException e) {
log.error("get img stream meets error :", e);
} finally {
IOUtils.closeQuietly(is);
}
if (ArrayUtils.isEmpty(data)) {
String errorMessage = String.format("downLoad img verify code with xpath %s failed.", xpath);
throw new EnniuCrawlException(TargetResponseError.ERROR_RESPONSE_BODY, errorMessage);
}
String base64Img = Base64Utils.encodeToString(data);
I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.
I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
}
When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.
Document doc = Jsoup.connect(url).get();
String text = doc.text();
That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.
for (Element p : paragraphs)
text+=p.text(); // Append the text.
In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.
I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.
I hope this code will fulfill your requirements without jSoup.
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}
sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.
You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.
You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}