I'm trying to get the textbox with u_0_1e as id, from the page wall but HtmlUnit does not find anything. The last line prints null.
Here's the code:
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
JavaScriptEngine engine = new JavaScriptEngine(client);
client.setJavaScriptEngine(engine);
HtmlPage home = client.getPage("https://www.facebook.com/login.php");
HtmlSubmitInput login = (HtmlSubmitInput) home.getElementById("u_0_1");
HtmlTextInput name = (HtmlTextInput) home.getElementById("email");
HtmlPasswordInput pass = (HtmlPasswordInput) home.getElementById("pass");
name.setValueAttribute("myname");
pass.setValueAttribute("mypass");
HtmlPage page = login.click();
HtmlPage wall = client.getPage("https://www.facebook.com/");
System.out.println(wall.getElementById("u_0_1e"));
I have some comments about your issue.
First of all, you have disabled HtmlUnit's logging. So if you have any JavaScript issue then you are not going to see it. If you are actually getting a JavaScript error then the JavaScript code won't be fully executed. If the element you're trying to fetch was dynamically fetched from the server (probably using AJAX) then the JavaScript errors, if any, might result in that element not being fetched.
If you are webscraping, which is clearly the case, then you don't have any control over the JS so you can only accept it as not working or disable JS and manually processing the AJAX requests.
Of course, you will see the page perfectly working on a real browser but take into consideration that the JavaScript engine HtmlUnit uses is different from the real browsers.
Secondly, the two lines containing the word engine are absolutely unneeded.
Thirdly, as I mentioned in a previous question of yours, this will be more suitable to be handled by means of the Facebook API.
Finally, you might find this other answer useful:
JavaScript not being properly executed in HtmlUnit
Related
Background
I am using htmlUnit to simulate user behavior in a certain page
I am reaching to a login page which I need to enter the user credentials
Issue:
The form that I am suppose to fill in the details dynamically changes and pushes new input fields with value that changes upon each char inserted.
This input field has several event listeners which as far as I was able to find from Chrome debugging the keypress event is the most relevant to me as this what ultimately generates the updates value
I am getting the following errors when the page "loads":
[User1st] An error occurred while extracting lang code TypeError: Cannot call method "getAttribute" of undefined
4.c.g.h.javascript.StrictErrorReporter : runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: *:x).] sourceName= https://???/jquery-1.10.2.min.js] line=[3] lineSource=[null] lineOffset=[0]
some code:
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(5000);
final HtmlPage page = webClient.getPage(WEBSITE_URL);
HtmlForm loginForm = page.getFormByName("login");
HtmlTextInput userIdField = loginForm.getInputByName("USERID");
HtmlPasswordInput passwordField = loginForm.getInputByName("USERPASSWORD");
userIdField.type("ID");
passwordField.setText("PASSWORD");
What I am doing next is simply iterating the form input fields and see their value.
How can I make sure that all related js code really getting executed if any?
I'm not sure if this helps but just letting the script sleeps work for me. This probably gives time loading all the js scripts.
Thread.sleep(2000);
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
Here is my code to get the page:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url);
The problem is the webClient always executes javascript automatically and throws me a list of error. I just want to get the raw source. How can I prevent it from executing script? I've found there is a way in version 2.9:
webClient.setJavaScriptEnabled(false);
But setJavaScriptEnabled() function was deprecated. Anyone knows how to solve this problem? Please help me. Thank you so much.
Although setJavaScriptEnabled(boolean) was deprecated it was added to the WebClientOptions member of the WebClient. Here is the doc.
In order to disable JavaScript you should do this:
webClient.getOptions().setJavaScriptEnabled(false);
Additionally, if you you want to get the raw HTML code from the webpage you should take a look at this question:
How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
Take into account that even the asXml() method change the formatting as well as the content of the web page you fetch (even if JavaScript is disabled).
I am writing a Java application that has a log in screen. Ideally, I would like to take the user supplied data (name, password), and submit it to an ASP form that can verify their credentials. I do not own the ASP form, I can only access the URL. I also do not want the user to be entering their credentials straight into the web form. They would enter their credentials into my program, and my program would put the data into the form and submit, and allow/deny the user based on the response.
Of course, the submit button on the ASP form is a POST request. However, constructing the URL (...login?username=name&password=pass) does not work, as the form must be submitted via the button with the text boxes filled in.
I have tried two approaches:
Using Java's URLConnection class. This does not seem to work because the form submitting is limited to the method I mentioned above, which is constructing the URL.
Using Javascript to access and edit the elements on the page. This has not worked either, because the Javascript is being run from my program, which is not a web browser, and therefore has no access to the 'document' or 'window' commonly used.
Other potential solutions I can think of:
Opening a browser to the login page but not giving it focus, running a script to fill out and submit the form, parsing the response, and then closing the browser. This would not involve the user at all, except for the input into the login page in my Java program.
Using a 3rd party Java library (suggestions? references to tutorials?).
Embedding the URL into my login screen (any help in this regard would be appreciated).
The things that cannot be changed are that my program is in Java, and that the login URL is an ASP form that hides the POST data from the URL.
Let me know if anything needs clarification. Any help is welcome.
try htmlunit, although it was designed for testing it would be ideal for this. You can use it in conjunction with Selenium webdriver
Why don't you open up your ASP form in an IFrame using javascript populate all the fields & then post it.
This should solve your problem.
Sfk is correct, i had a similar problem and manage to fill the form and submit it with Htmlunit .#sfk many thanks, you put me in the right path.
So with htmlunit
//for chrome simulation
WebClient webClient = new WebClient(BrowserVersion.CHROME_16);
//has getting an error from [http://www.google-analytics.com/ga.js with javascript on.
webClient.setJavaScriptEnabled(false);
HtmlPage page = webClient.getPage("http://yourtargetpage/Default.aspx");
//get the form by name, check page source for name
HtmlForm form = page.getFormByName("aspnetForm");
HtmlPasswordInput inputPass = form.getInputByName("your input password text field name");
HtmlTextInput userName = form.getInputByName("your input user text field name");
HtmlSubmitInput button=form.getInputByName("your target submit button");
//set username and password
userName.setText("myuser");
inputPass.setText("mypassword");
//click the submit button and get the returned page
HtmlPage page2 = button.click();
That´s it.. you got the reply page and sent the information on the fields..you can now parse the page and get the site response..
I want to fill a text field of a HTTP form through java and then want to click on the submit button through java so as to get the page source of the document returned after submitting the form.
I can do this by sending HTTP request directly but I don't to this in this way.
I usually do it using HtmlUnit. They have an example on their page :
#Test
public void submittingForm() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("root");
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
}
And you can read more here.
If you don't want to talk HTTP directly (why?), then take a look at Watij.
It allows you to invoke a browser (IE) as a COM control within your Java process, navigate through page elements by using their document ids etc., fill in forms and press buttons. Because it's running a browser, Javascript will run as normal (like if you were doing this manually).
You would probably need to write a Java Applet, as the only other way than sending a direct request would be to have it interface with the browser.
Of course, for this to work, you would have to embed the applet in the page. If you don't control the page, this can't be done. If you do control the page, you might as well be using Javascript, instead of trying to get a Java Applet to do it, which would be much more cumbersome and difficult.
Just to clarify, what is the problem you are having creating an HTTP Request and why do you want to use a different method?