Here is my code to get the page:
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url);
The problem is the webClient always executes javascript automatically and throws me a list of error. I just want to get the raw source. How can I prevent it from executing script? I've found there is a way in version 2.9:
webClient.setJavaScriptEnabled(false);
But setJavaScriptEnabled() function was deprecated. Anyone knows how to solve this problem? Please help me. Thank you so much.
Although setJavaScriptEnabled(boolean) was deprecated it was added to the WebClientOptions member of the WebClient. Here is the doc.
In order to disable JavaScript you should do this:
webClient.getOptions().setJavaScriptEnabled(false);
Additionally, if you you want to get the raw HTML code from the webpage you should take a look at this question:
How to get the pure raw HTML of a page in HTMLUnit while ignoring JavaScript and CSS?
Take into account that even the asXml() method change the formatting as well as the content of the web page you fetch (even if JavaScript is disabled).
Related
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
I'm trying to using the jsoup library to get 'li' from a website. The problem is this:
If I open the source of website with CTRL+U(which is the same read by jsoup), the 'ul' tag is hidden.
if I open the code with the fuction "inspect code" of google chrome,'li' are shown.
Posting the code is not necessary; I only want to know how can access to this 'li' with jsoup or other java free libraries, Whereas in the source code(and through jsoup) these informations are hidden.
The site is https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco and try to search something(i.e. Tachi)
The problem with Jsoup is that it won't handle scripts. It is just getting html as it is before the AJAX code is executed.
You can use something like HtmlUnit, which is basically a GUI-less browser. So, it can handle scripts.
You can try something like this after getting the HtmlUnit library:
String url = "https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco?search=Tachi";
try(final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage(url);
final HtmlUnorderedList list = page.getHtmlElementById("ul_farm_results");
System.out.println(list.asText());
}
I couldn't check the code as the website's certificate is improperly configured and I didn't want to import it's certificate. You may want to take a look at this to resolve the certificate errors.
JSoup does not execute all the scripts, it just gets the HTML returned by the server. What you are looking for is call rendered HTML, that is the HTML produced by the browser after executing all the scripts.
The best solution in Java is to use Selenium with your preferred browser. Selenium was developed for UI testing, it is however very popular as a scraping tool.
A good getting started page is to be found here.
Some code example with Firefox:
WebDriver driver = new FirefoxDriver();
driver.get("https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco");
// Find the element
String id = "ul_farm_results";
WebElement element = driver.findElement(By.id(id));
I'm trying to get the textbox with u_0_1e as id, from the page wall but HtmlUnit does not find anything. The last line prints null.
Here's the code:
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient client = new WebClient(BrowserVersion.CHROME);
JavaScriptEngine engine = new JavaScriptEngine(client);
client.setJavaScriptEngine(engine);
HtmlPage home = client.getPage("https://www.facebook.com/login.php");
HtmlSubmitInput login = (HtmlSubmitInput) home.getElementById("u_0_1");
HtmlTextInput name = (HtmlTextInput) home.getElementById("email");
HtmlPasswordInput pass = (HtmlPasswordInput) home.getElementById("pass");
name.setValueAttribute("myname");
pass.setValueAttribute("mypass");
HtmlPage page = login.click();
HtmlPage wall = client.getPage("https://www.facebook.com/");
System.out.println(wall.getElementById("u_0_1e"));
I have some comments about your issue.
First of all, you have disabled HtmlUnit's logging. So if you have any JavaScript issue then you are not going to see it. If you are actually getting a JavaScript error then the JavaScript code won't be fully executed. If the element you're trying to fetch was dynamically fetched from the server (probably using AJAX) then the JavaScript errors, if any, might result in that element not being fetched.
If you are webscraping, which is clearly the case, then you don't have any control over the JS so you can only accept it as not working or disable JS and manually processing the AJAX requests.
Of course, you will see the page perfectly working on a real browser but take into consideration that the JavaScript engine HtmlUnit uses is different from the real browsers.
Secondly, the two lines containing the word engine are absolutely unneeded.
Thirdly, as I mentioned in a previous question of yours, this will be more suitable to be handled by means of the Facebook API.
Finally, you might find this other answer useful:
JavaScript not being properly executed in HtmlUnit
Am trying to access the servlet page using htmlunit which contains one image.
I need to save the image or need to save the servlet page into html page.
Now am using
(UnexpectedPage) webClient.getPage(new URL("https://www.xxxx.com/servlet/xxxSer")
WebResponse response = currentPage.getWebResponse();
response.getContentType();
After that I do not know what to do. Is there any idea to do this job.
Thanks in advance.
You need to get the text content of the WebResponse (you also don't need the URL object):
Page page = webClient.getPage("https://www.xxxx.com/servlet/xxxSer");
String content = page.getWebResponse().getContentAsString();
Regarding the image you should be more clear on how you're getting it. If it is an image that is referenced in an IMG tag then use an HtmlPage and an HtmlImage. If you're requesting the image directly probably you should use page.getWebResponse().getContentAsStream()
Try this code
HtmlPage htmlpage = webClient.getPage(new URL("https://www.xxxx.com/servlet/xxxSer"));
String htmlcode = htmlpage.getWebResponse().getContentAsString();
Best
The problem is that HTML Unit is not able to cast incompleted HTML Pages (some unclosing tags, for example). So, I could solve this error using HTMLParser which is included in HTMLUnit's packages (I'm using 2.36.0v). HTMLParser completes and handles this kind of casting errors. HtmlPage works if you need to execute JS.
//Web client creation.
Page page = webClient.getPage(url);
HtmlPage tmpPage = HTMLParser.parseHtml(page.getWebResponse(), webClient.getCurrentWindow());
// use tmpPage here
Is there a way to make an AJAX call alter the current page URL without redirecting or reloading the page, in Apache Wicket?
For example, say we are in the url:
localhost:8080/someUrl
I'd like that when I click an ajax link, some action is performed, and the url changes to, say:
localhost:8080/otherUrl
without redirecting, just changing the url displayed in the browser.
Is this even possible?
Thanks!
Manuel
Actually you can !
But this is not related to Wicket at all.
This is what the new History API in HTML5 is about.
Just search for "html5 History API example" in Google and enjoy.
The only part of the url you can change with javascript is the hash
You could change localhost:8080/#/someUrl to localhost:8080/#/otherUrl
Do this with window.location.hash
Here's an example of a flash site which uses this concept to allow for deep-linking URL's: http://www.2advanced.com
Help make this feature happen, vote (or contribute!) for https://issues.apache.org/jira/browse/WICKET-5290
No, it isn't. If you change the location in the browser, a new request is made to that URL.
(You do that with window.location.href = newUrl, but the page reloads)