I am trying to get all the main links, then click on them and navigation to the page:
WebClient client = new WebClient();
HtmlPage page = client.getPage(url);
// Get all links with a href of www.example.com/pages/1_
List<HtmlAnchor> links = (List<HtmlAnchor>) page.getByXPath("//a[href='www.example.com/pages/1_*'");
links[0].click();
After calling click, does it return a HtmlPage? (The NetBeans documentation is not telling me.)
Does the XPath expression look OK?
I don't know how documentation works in NetBeans but the documentation is all available online, and if you go to it, you'll see that the return type is <P extends Page> which will probably be HtmlPage most of the time, but could also be XmlPage or something like that.
Simulates clicking on this element,
returning the page in the window that
has the focus after the element has
been clicked. Note that the returned
page may or may not be the same as the
original page, depending on the type
of element being clicked, the presence
of JavaScript action listeners, etc.
Related
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
I'm trying to get the textbox with u_0_1e as id, from the page wall but HtmlUnit does not find anything. The last line prints null.
Here's the code:
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
JavaScriptEngine engine = new JavaScriptEngine(client);
client.setJavaScriptEngine(engine);
HtmlPage home = client.getPage("https://www.facebook.com/login.php");
HtmlSubmitInput login = (HtmlSubmitInput) home.getElementById("u_0_1");
HtmlTextInput name = (HtmlTextInput) home.getElementById("email");
HtmlPasswordInput pass = (HtmlPasswordInput) home.getElementById("pass");
name.setValueAttribute("myname");
pass.setValueAttribute("mypass");
HtmlPage page = login.click();
HtmlPage wall = client.getPage("https://www.facebook.com/");
System.out.println(wall.getElementById("u_0_1e"));
I have some comments about your issue.
First of all, you have disabled HtmlUnit's logging. So if you have any JavaScript issue then you are not going to see it. If you are actually getting a JavaScript error then the JavaScript code won't be fully executed. If the element you're trying to fetch was dynamically fetched from the server (probably using AJAX) then the JavaScript errors, if any, might result in that element not being fetched.
If you are webscraping, which is clearly the case, then you don't have any control over the JS so you can only accept it as not working or disable JS and manually processing the AJAX requests.
Of course, you will see the page perfectly working on a real browser but take into consideration that the JavaScript engine HtmlUnit uses is different from the real browsers.
Secondly, the two lines containing the word engine are absolutely unneeded.
Thirdly, as I mentioned in a previous question of yours, this will be more suitable to be handled by means of the Facebook API.
Finally, you might find this other answer useful:
JavaScript not being properly executed in HtmlUnit
I'm trying to fetch data from this webpage: http://www.atm-mi.it/en/Giromilano/Pages/default.aspx. Basically I'm using HtmlUnit in Java to interact with the "Route and timetable finder" in the middle of the left column, looping through each option in the select, clicking on "Find" and gathering the data I need from the resulting pages.
I've had no problem extracting data for urban routes, but can't seem to handle the radio buttons above: clicking on "Underground" in a browser, for example, should bring a new page with different options in the select below.
But I keep getting the same Select as before; to be more precise, I keep getting the same page (page2 has the same HTML code as page).
Clearly something must be going wrong in the .click() function, but what?
This is a simple version of my code:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setThrowExceptionOnScriptError(false);
HtmlPage page = webClient.getPage("http://www.atm-mi.it/en/Giromilano/Pages/default.aspx");
HtmlRadioButtonInput radioButton2 = (HtmlRadioButtonInput) page.getElementById("ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_lines_type_rbl_0");
HtmlPage page2 = radioButton2.click();
HtmlSelect lineSelect = (HtmlSelect) page2.getElementById("ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_txt_dp_lines");
int size = lineSelect.getOptionSize();
System.out.println(size);
This is the radio button input HTML:
<input id="ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_lines_type_rbl_0" type="radio" name="ctl00$SPWebPartManager1$g_e31ad29e_62a8_401c_43ae_eb61300b4fc0$lines_type_rbl" value="0" onclick="javascript:setTimeout('__doPostBack(\'ctl00$SPWebPartManager1$g_e31ad29e_62a8_401c_43ae_eb61300b4fc0$lines_type_rbl$0\',\'\')', 0)" />
<label for="ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_lines_type_rbl_0">Underground</label>
The select:
<select name="ctl00$SPWebPartManager1$g_e31ad29e_62a8_401c_43ae_eb61300b4fc0$txt_dp_lines" id="ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_txt_dp_lines" class="dplinee">
EDIT:
Ok, so I've tried a different approach: since it looked like some kind of JavaScript engine problem, I figured I could try and disable JavaScript, carrying out the onclick action myself. This is the original JavaScript function:
var theForm = document.forms['aspnetForm'];
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
And this is what I did:
HtmlForm aspnetForm = (HtmlForm) page.getElementById("aspnetForm");
HtmlHiddenInput eventTarget = (HtmlHiddenInput) page.getElementById("__EVENTTARGET");
HtmlHiddenInput eventArgument = (HtmlHiddenInput) page.getElementById("__EVENTARGUMENT");
eventTarget.setValueAttribute("ctl00$SPWebPartManager1$g_e31ad29e_62a8_401c_43ae_eb61300b4fc0$lines_type_rbl$0");
eventArgument.setValueAttribute("");
HtmlElement submitButton = (HtmlElement) page.createElement("button");
submitButton.setAttribute("type", "submit");
aspnetForm.appendChild(submitButton);
HtmlPage page2 = submitButton.click();
All good, except I still keep getting the same page with the same old Select.
I know this is quite a long and boring question, but I thought I could update it anyway. I hope somebody will eventually have the patience to try this out (and at least confirm I'm not doing some obvious mistake).
I finally found a way to make this work. The second approach was almost right. I was correctly submitting the form, but with a difference from normal browsing: I didn't actually check the radio button. Apparently, the destination page used that information too. By adding this
HtmlRadioButtonInput radioButton = (HtmlRadioButtonInput) page.getElementById("ctl00_SPWebPartManager1_g_e31ad29e_62a8_401c_43ae_eb61300b4fc0_lines_type_rbl_0");
radioButton.setChecked(true);
to my previous attempt the submit action worked perfectly. I still don't know why the .click() method didn't work as expected, though, but this is good enough for me.
Here is my situation: the user selects a section (for example from a dropdown) such as "Section1," "Section2" or "Section3." Then he clicks the OK button (or some link).
What I need to happen: after he clicks on that button/link, he will be redirected to the selected section, e.g. www.homepage.com/docs#section2.
So far, I have not been able to process the form from Link's onClick method, nor have I been able to call some clickLink on Link from the Button method onSubmit().
I would prefer not to use AJAX or JavaScript. How can I do this?
That's because a Link doesn't submit the form. It just acts as a link to somewhere. To access your formdata you'll need to submit the form first. Try using a SubmitLink instead of a Link and call
getRequestCycle().setRequestTarget
(new RedirectRequestTarget("www.homepage.com/docs#section2"));
from the onSubmit function of the SubmitLink.
Judging from the Javadoc this should work but I can't test it right now.
A RequestTarget that will send a redirect url to the browser. Use this if you
want to direct the browser to some external URL, like Google etc, immediately.
Or if you want to redirect to a Wicket page. If you want to redirect with a
delay the RedirectPage will do a meta tag redirect with a delay.
Did you try Link.setAnchor(Component)?
I want to fill a text field of a HTTP form through java and then want to click on the submit button through java so as to get the page source of the document returned after submitting the form.
I can do this by sending HTTP request directly but I don't to this in this way.
I usually do it using HtmlUnit. They have an example on their page :
#Test
public void submittingForm() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("root");
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
}
And you can read more here.
If you don't want to talk HTTP directly (why?), then take a look at Watij.
It allows you to invoke a browser (IE) as a COM control within your Java process, navigate through page elements by using their document ids etc., fill in forms and press buttons. Because it's running a browser, Javascript will run as normal (like if you were doing this manually).
You would probably need to write a Java Applet, as the only other way than sending a direct request would be to have it interface with the browser.
Of course, for this to work, you would have to embed the applet in the page. If you don't control the page, this can't be done. If you do control the page, you might as well be using Javascript, instead of trying to get a Java Applet to do it, which would be much more cumbersome and difficult.
Just to clarify, what is the problem you are having creating an HTTP Request and why do you want to use a different method?