Cannot download full Document using HtmlUnit and Jsoup combination (using Java) - java

Problem Statement:
I want to crawl this page : http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0
Lets say I want to parse the address, that is "24, Middle Gap Road, The Peak, Hong Kong"
What I did:
I first only tried to load using jsoup, but then I noticed that the page is taking some time to load. So, then I also plugged in HTMLUnit to wait for the page to load first
Code I wrote:
public static void parseByHtmlUnit() throws Exception{
String url = "http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0";
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
webClient.waitForBackgroundJavaScriptStartingBefore(30000);
HtmlPage page = webClient.getPage(url);
synchronized(page) {
page.wait(30000);
}
try {
Document doc = Jsoup.parse(page.asXml());
String address = ElementsUtil.getTextOrEmpty(doc.select(".addr"));
System.out.println("address"+address);
} catch (Exception e) {
e.printStackTrace();
}
}
Expected output :
In the console, I should get this output:
address 24, Middle Gap Road, The Peak, Hong Kong
Actual output :
address

How about this?
final Document document = Jsoup.parse(
new URL("http://www.hongkonghomes.com/en/property/rent/the_peak/middle_gap_road/10305?page_no=1&rec_per_page=12&order=rental+desc&offset=0"),
30000
);
System.out.println(document.select(".addr").text());

Related

Determine the nesting of pages on the site

It is necessary to determine the level of nesting of pages in clicks from home page. How to do it right? I understand that all the pages from the site will get recursively.
The code will look like this:
public void getPageLinks(String URL) {
//4. Check if you have already crawled the URLs
//(we are intentionally not checking for duplicate content in this example)
if (!links.contains(URL)) {
try {
//4. (i) If not add it to the index
if (links.add(URL)) {
System.out.println(URL);
}
//2. Fetch the HTML code
Document document = Jsoup.connect(URL).get();
//3. Parse the HTML to extract links to other URLs
Elements linksOnPage = document.select("a[href]");
//5. For each extracted URL... go back to Step 4.
for (Element page : linksOnPage) {
getPageLinks(page.attr("abs:href"));
}
} catch (IOException e) {
System.err.println("For '" + URL + "': " + e.getMessage());
}
}
}
Only still there will be a check whether the link is a link to an external site, if so then you do not need to go to it.

fill a form in a asp dynamic page with htmlunit

I'm making a little script in java to check iPhone IMEI numbers.
There is this site from Apple :
https://appleonlinefra.mpxltd.co.uk/search.aspx
You have to enter an IMEI number. If this number is OK, it drives you to this page :
https://appleonlinefra.mpxltd.co.uk/Inspection.aspx
Else, you stay on /search.aspx page
I want to open the search page, enter an IMEI, submit, and check if the URL has changed. In my code there is a working IMEI number.
Here is my java code :
HtmlPage page = webClient.getPage("https://appleonlinefra.mpxltd.co.uk/search.aspx");
HtmlTextInput imei_input = (HtmlTextInput)page.getElementById("ctl00_ContentPlaceHolder1_txtIMEIVal");
imei_input.setValueAttribute("012534008614194");
//HtmlAnchor check_imei = page.getAnchorByText("Rechercher");
//Tried with both ways of getting the anchor, none works
HtmlAnchor anchor1 = (HtmlAnchor)page.getElementById("ctl00_ContentPlaceHolder1_imeiValidate");
page = anchor1.click();
System.out.println(page.getUrl());
I can't find out where it comes from, since i often use HTMLUnit for this and i never had this issue. Maybe because of the little loading time after submiting ?
Thank you in advance
You can do this by using a connection wrapper that HTMLUnit provides
Here is an example
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("Inspection.aspx")) {
String content = response.getContentAsString("UTF-8");
WebResponseData data = new WebResponseData(content.getBytes("UTF-8"), response.getStatusCode(),
response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
With the connection wrapper above, you can check for any request and response that is passing through HTMLUnit

jsoup- Search Form's URL not changing in post()

I'm Developing a web crawler.
I need to insert some value into the input field of a form (for a search) and get the result programatically. The form has a post method and the action value is "/SetReviewFilter#REVIEWS".
But the problem is when I do the search from the website manually the URL of the website don't change.I think the web page is self posting
Here the link of the Webpage
I got no idea to how to implement this.But I tried this
private Document getReviewSearchDocument(Document search,String search_url)
{
//search_url mean the url of that search document I fetched previously
// search means the current document of the webpage
Element input = search.getElementsByClass("ratings_and_types").first();
Element link = input.select("div:nth-child(1) > form ").first();
Document rdocument= null;
if (link !=null) {
System.out.println("form found! on: "+link_value);
} else {
System.out.println("Form not found");
}
Connection connection = Jsoup.connect(search_url + "/SetReviewFilter#REVIEWS").timeout(30 * 1000).ignoreContentType(true).ignoreHttpErrors(true);
try {
Connection.Response resp = connection.execute();
if (resp.statusCode() ==200) {
rdocument = connection.data("q",this.keywords).userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36").execute().parse();
System.out.println("Success: "+ resp.statusCode());
System.out.println("document: "+ rdocument.text().toString());
}
else{
System.out.println("no search match");
}
} catch (IOException e) {
e.printStackTrace();
}
return rdocument;
}
If any body have a idea on this please share it.
Thank You.
I tried few alternatives and modified my code to call a JSOUP POST request to get the job done.But I got failed several times due to the problems with cookies.I found that, for this single post request it requires almost 50 cookies(Thanks to Chrome console).And some cookies I couldn't fill it my self because those cookies were linked to different websites(eg: facebook).And the worst scenario is that I have to make this request depending on the number of hotels per city.So sometimes it can be up to 85 000 ,So it will be costly process.(-5 for me for didn't see that coming)
There for I rebuild the project through Web Automation using Selenium in Java.And the searching in forms became so easy.Thank You!

Getting All YouTube Video Ids/Urls From Video Manager

I'm using Selenium to log into my Google account and to visit YouTube.
Now on the video manager I would like to get all of my video ids. I tried copying the CSSSelector or XPath which the developer tools in Chrome give me but each of them contain the video id which makes them impossible to use like this:
List<WebElement> allVideoUrls = driver.findElements(By.cssSelector("my-selector-which-gives-all-videos-on-page"));
Note that I have to be logged in to be able to "see" unlisted or private videos as well so that's required.
So far I have a bad implementation which sometimes fails to work for some reason. I firstly get all the links on the page and only return the ones which are for editing a video. To avoid a StaleElementReferenceException I'm retrieving all links again inside the loop.
public void getVideoInformation()
{
// Visit video manager
driver.get("https://www.youtube.com/my_videos?o=U");
// Wait until video list has loaded
new WebDriverWait(driver, 10).until(ExpectedConditions
.visibilityOfElementLocated(By
.cssSelector("#vm-playlist-video-list-ol")));
// Return all links on page
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
HashSet<String> videoLinks = new HashSet<>();
for (int linksIndex = 0; linksIndex < allLinks.size(); linksIndex++)
{
String link = driver.findElements(By.tagName("a")).get(linksIndex)
.getAttribute("href");
try
{
if (link.contains("edit"))
{
System.out.println(link);
// No duplicates
videoLinks.add(link);
}
} catch (Exception error)
{
error.printStackTrace();
}
}
// ...
}
I'm fine with the fact that I need to load every other page as well to get all the videos but please help me to find an efficient/reliable way of getting the video ids.

Writing a simple web crawler that interacts with the browser (Java)

I need to create an automated process (preferably using Java) that will:
Open browser with specific url.
Login, using the username and password specified.
Follow one of the links on the page.
Refresh the browser.
Log out.
This is basically done to gather some statistics for analysis. Every time a user follows the link a bunch of data is generated for this particular user and saved in database. The thing I need to do is, using around 10 fake users, ping the page every 5-15 min.
Can you tink about simple way of doing that? There has to be an alternative to endless login-refresh-logout manual process...
Try Selenium.
It's not Java, but Javascript. You could do something like:
window.location = "<url>"
document.getElementById("username").value = "<email>";
document.getElementById("password").value = "<password>";
document.getElementById("login_box_button").click();
...
etc
With this kind of structure you can easily cover 1-3. Throw in some for loops for page refreshes and you're done.
Use HtmlUnit if you want
FAST
SIMPLE
java based web interaction/crawling.
For example: here is some simple code showing a bunch of output and an example of accessing all IMG elements of the loaded page.
public class HtmlUnitTest {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://www.google.com");
System.out.println(page.getTitleText());
for (HtmlElement node : page.getHtmlElementDescendants()) {
if (node.getTagName().toUpperCase().equals("IMG")) {
System.out.println("NAME: " + node.getTagName());
System.out.println("WIDTH:" + node.getAttribute("width"));
System.out.println("HEIGHT:" + node.getAttribute("height"));
System.out.println("TEXT: " + node.asText());
System.out.println("XMl: " + node.asXml());
}
}
}
}
Example #2 Accessing named input fields and entering data/clicking:
final HtmlPage page = webClient.getPage("http://www.google.com");
HtmlElement inputField = page.getElementByName("q");
inputField.type("Example input");
HtmlElement btnG = page.getElementByName("btnG");
Page secondPage = btnG.click();
if (secondPage instanceof HtmlPage) {
System.out.println(page.getTitleText());
System.out.println(((HtmlPage)secondPage).getTitleText());
}
NB: You can use page.refresh() on any Page object.
You could use Jakarta JMeter

Categories