How to capture the specific browser contents in Java? - java

I am accessing the website (http://www.bacnet.org/Addenda/) using htmlunit api library in Java.
I am able to get the contents of entire page but I would like to capture only specific area.
This is how I am fetching the page:
public static void getBACnetStandard() throws FailingHttpStatusCodeException,
MalformedURLException, IOException
{
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
String pageContent = page.asText();
System.out.println(pageContent);
}
I would like to capture the highlighted area (in RED box) from the entire page.

First get the Id of Element on HtmlPage
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_div_id and get that Element here
HtmlDivision div = page.getHtmlElementById("target_div_id");
or you want to use other custom attribute to fetch that DOM Element, like I'm giving example using target_class_value
HtmlPage page = webClient.getPage("http://www.bacnet.org/Addenda/");
// inspect target_class_value and get that Element here
HtmlDivision div = page.getByXPath("//div[#class='target_class_value']");

Related

Problem with enabling JavaScript with HtmlUnit

I'm a newbie to HtmlUnit, and I'm writing a demo script to load the source HTML of a webpage and write it to a txt file.
public static void main(String[] args) throws IOException {
try (final WebClient wc = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
wc.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = wc.getPage("https://www.sainsburys.co.uk/gol-ui/SearchResults/biscuits");
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
FileWriter fw = new FileWriter(dir + "/pageHtml.txt");
fw.write(html);
fw.close();
}
}
However, it returns the HTML for disabled JavaScript. To try and fix this, I added this line to ensure JS is enabled on the WebClient:
wc.getOptions().setJavaScriptEnabled(true);
Despite that, nothing changes. Am I being an idiot, or is there something more subtle that needs to change?
Thanks for any help! ^_^
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
This is the response (code) you got from the server. If you like to have the current DOM (the one after the js processing is done you can do something like
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(60_000);
System.out.println(page.asXml());
or
System.out.println(page.asNormalizedText());

How to use PDFBox to create a link i can click to go to another page in the same document

I am trying to use PDFBox to create a link i can click to go to another page in the same document.
From this question (How to use PDFBox to create a link that goes to *previous view*?) I see that this should be easy to do, but when i try to do it I get this error: Exception in thread "main" java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
I am using this code:
//Loading an existing document consisting of 3 empty pages.
File file = new File("C:\\Users\\Student\\Documents\\MyPDF\\Test_doc.pdf");
PDDocument document = PDDocument.load(file);
PDPage page = document.getPage(1);
PDAnnotationLink link = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action = new PDActionGoTo();
destination.setPageNumber(2);
action.setDestination(destination);
link.setAction(action);
link.setPage(page);
I am using PDFBox 2.0.13, can anyone give me some guidance on what I'm doing wrong?
Appreciate all answers.
First of all, for a local link ("a link i can click to go to another page in the same document"), destination.setPageNumber is the wrong method to use, cf. its JavaDocs:
/**
* Set the page number for a remote destination. For an internal destination, call
* {#link #setPage(PDPage) setPage(PDPage page)}.
*
* #param pageNumber The page for a remote destination.
*/
public void setPageNumber( int pageNumber )
Thus, replace
destination.setPageNumber(2);
by
destination.setPage(document.getPage(2));
Furthermore, you forgot to set a rectangle area for the link and you forgot to add the link to the page annotations.
All together:
PDPage page = document.getPage(1);
PDAnnotationLink link = new PDAnnotationLink();
PDPageDestination destination = new PDPageFitWidthDestination();
PDActionGoTo action = new PDActionGoTo();
destination.setPage(document.getPage(2));
action.setDestination(destination);
link.setAction(action);
link.setPage(page);
link.setRectangle(page.getMediaBox());
page.getAnnotations().add(link);
(AddLink test testAddLinkToMwb_I_201711)

How can I get the first URL result on Google Video Search? (tag selector)

I want to get the first URL result google video search programmatically using JSoup. I have a problem with Google video encoding or Html tags.(Probably HTML tag: .g>.r>a)
public static String getYoutubeURLByName(String search) throws UnsupportedEncodingException, MalformedURLException, MalformedURLException, MalformedURLException, MalformedURLException, IOException {
String google = "https://www.google.com/videohp?hl=";
String charset = "UTF-8";
String userAgent = "Mozilla/5.0";
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
String url = links.get(0).absUrl("href");
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
return url;
}
I get the following error. I think that Jsoup.connect can't add Element on array list because of something is wrong with encoding url or html tag. (Probably html tag, my poor html :( )
at java.util.ArrayList.rangeCheck(ArrayList.java:657)
at java.util.ArrayList.get(ArrayList.java:433)
at getYoutubeURLByName(.java:100)
Not possible duplicate of: How can you search Google Programmatically Java API
Important Edit:
And string google should be:
https://www.google.com/search?tbm=vid&hl=en-TR&source=hp&biw=&bih=&q=
It seems that your selector is not correct, because elements are not direct childs, try using:
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g .r a");

Storing text into a String using jSoup

I'm trying to understand how to use htmlUnit and jSoup together and have been successful in understanding the basics. However, I'm trying to store text from a specific webpage into a string but when I try to do this, it only returns a single line rather than the whole text.
I know the code I've written works as I when I print out p.text, it returns the whole text stored within the website.
private static String getText() {
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
System.out.println(p.text());
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return null;
}
}
When I introduce the notion of a string to store the text from p.text, it only returns a single line rather than the whole text.
private static String getText() {
String text = "";
try {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
String url = page1.getUrl().toString();
Document doc = Jsoup.connect(url).get();
Elements paragraphs = doc.select("div[class=govspeak] p");
for (Element p : paragraphs)
text=p.text();
} catch (Exception e) {
e.printStackTrace();
Logger.getLogger(HTMLParser.class.getName()).log(Level.SEVERE, null, e);
}
return text;
}
Ultimately, all I want to do is store the whole text into a string. Any help would be greatly appreciated, thanks in advance.
Document doc = Jsoup.connect(url).get();
String text = doc.text();
That's basically it. Due to the fact that JSoup is already taking care of cleaning all the html tags from the text, you can use the doc.text() and you'll receive the content of the whole page cleaned from html tags.
for (Element p : paragraphs)
text+=p.text(); // Append the text.
In your code, you are overwriting the values of variable text. That's why only last line is returned by the function.
I think it is a strange idea to use the HtmlUnit result as starting point for jSoup. There a various drawbacks of your approach (e.g. think about cookies). And of course HtmlUnit had parsed the html code already; you will do the work twice.
I hope this code will fulfill your requirements without jSoup.
private static String getText() throws FailingHttpStatusCodeException, MalformedURLException, IOException {
StringBuilder text = new StringBuilder();
try (WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("https://www.gov.uk/government/policies/brexit");
List<HtmlAnchor> anchors = page.getAnchors();
HtmlPage page1 = anchors.get(18).click();
DomNodeList<DomNode> paragraphs = page1.querySelectorAll("div[class=govspeak] p");
for (DomNode p : paragraphs) {
text.append(p.asText());
}
}
return text.toString();
}

ElementNotFoundException in HtmlUnit, although the element exists

I've made a java server that scrapes a website, but my problem is that after a few requests (about 10 or so) I always get this error ElementNotFoundException, although the element should be there. Basically my program just checks every few minutes this website for info but after a few times it just gives me that exception.
This is my code for scraping, I don't know what's wrong with it that after a few times the element is not found..
final WebClient webClient = new WebClient();
try (final WebClient webClient1 = new WebClient()) {
final HtmlPage page = webClient.getPage("http://b7rabin.iscool.co.il/מערכתשעות/tabid/217/language/he-IL/Default.aspx");
WebResponse webResponse = page.getWebResponse();
String content = webResponse.getContentAsString();
// System.out.println(content);
HtmlSelect select = (HtmlSelect) page.getElementById("dnn_ctr914_TimeTableView_ClassesList");
HtmlOption option = select.getOptionByValue("" + userClass);
select.setSelectedAttribute(option, true);
//String jscmnd = "javascript:__doPostBack('dnn$ctr914$TimeTableView$btnChangesTable','')";
String jscmnd = "__doPostBack('dnn$ctr914$TimeTableView$btnChanges','')";
ScriptResult result = page.executeJavaScript(jscmnd);
HtmlPage page1 = (HtmlPage) result.getNewPage();
String content1 = page1.getWebResponse().getContentAsString();
//System.out.println(content1);
System.out.println("-----");
HtmlDivision getChanges = null;
String changes = "";
getChanges = page1.getHtmlElementById("dnn_ctr914_TimeTableView_PlaceHolder");
changes = getChanges.asText();
changes = changes.replaceAll("\n", "").replaceAll("\r", "");
System.out.println(changes);
}
The exception:
Exception in thread "Thread-0" com.gargoylesoftware.htmlunit.ElementNotFoundException: elementName=[*] attributeName=[id] attributeValue=[dnn_ctr914_TimeTableView_PlaceHolder]
at com.gargoylesoftware.htmlunit.html.HtmlPage.getHtmlElementById(HtmlPage.java:1552)
at scrapper$1.run(scrapper.java:108)
I am really desperate to solve it, it's the only bottleneck in my project.
You just need to wait a little before manipulating the second page, as hinted here.
So, sleep() for 3 seconds would make it always succeeds.
HtmlPage page1 = (HtmlPage) result.getNewPage();
Thread.sleep(3_000); // sleep for 3 seconds
String content1 = page1.getWebResponse().getContentAsString();
Also, you don't need to instantiate two instances of WebClient.

Categories