I am trying to figure out how to submit a form using Jsoup.
On Xfinity's website, I am trying to input an address and get back the resulting page after clicking on "Show me deals" from the url below:
https://www.xfinity.com/learn/offers
Here is my current code:
public String getISP() throws IOException {
Connection.Response addressFormResponse = Jsoup.connect("https://www.xfinity.com/learn/offers")
.data("Address.SingleStreetAddress", address)
.method(Connection.Method.POST)
.execute();
Document doc = addressFormResponse.parse();
System.out.println(doc.title());
System.out.println(doc.location());
if (doc.location().contains("Active Address")) {
return "Comcast XFinity";
}
return "Cannot find an ISP";
}
The current code only returns the same webpage, how would I get back the resulting page?
Jsoup is a HTML parser library, it provides functionality for extracting and manipulating data on HTML page. If you need traverse websites, submit forms, click elements, it's better to use another tools, like selenium, HTTP client (which are often used for automated test of web applications) or web crawler libraries like crawler4j.
I would tend to disagree with Daniil's answer in that neither HTTP Client or crawler4j support javascript which is required for this page. Selenium is probably the best solution.
What follows is an example of how to use jsoup to fetch a page, fill out a form, and submit it. The result is json and so you would then pass that string to gson or similar. I did not that the page was very flaky just in a regular browser, and sometimes would catch the address input and sometimes would barf on the same input.
Document doc = Jsoup.connect("https://www.xfinity.com/learn/offers").get();
FormElement form = (FormElement) doc.selectFirst("[data-form-dealfinder-localization]");
Element input = form.selectFirst("#Address_StreetAddress");
input.val("2000 YALE AVE E, SEATTLE, WA 98102");
String json = form.submit().ignoreContentType(true).execute().body();
System.out.println(json);
Related
I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas
I have started off trying to do the following, but there are no results from the get go:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty
I also tried this, but again no results:
Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements divs = document.select("div");
if (!divs.isEmpty()) {
for (Element div : divs) {
// all of these are empty
Elements verbTenses = div.getElementsByClass("verbtense");
Elements verbTables = div.getElementsByClass("verbtable");
Elements tables = div.getElementsByClass("table verbtable");
}
}
What am I doing incorrectly?
The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link
You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
Inspect browser console to see what page is doing, and do the same
The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time.
Jsoup can't parse and execute JavaScript so all you get is the initial page :(
The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests:
One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas
as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup.
General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):
String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();
then
you'll have to use some JSON parsing library for example json-simple
to obtain html fragment and then you can parse it to HTML with Jsoup:
String json = Jsoup.connect(
"https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
.ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);
Now you can try your initial approach with using selectors to get what you want from document object.
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
I'm trying to using the jsoup library to get 'li' from a website. The problem is this:
If I open the source of website with CTRL+U(which is the same read by jsoup), the 'ul' tag is hidden.
if I open the code with the fuction "inspect code" of google chrome,'li' are shown.
Posting the code is not necessary; I only want to know how can access to this 'li' with jsoup or other java free libraries, Whereas in the source code(and through jsoup) these informations are hidden.
The site is https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco and try to search something(i.e. Tachi)
The problem with Jsoup is that it won't handle scripts. It is just getting html as it is before the AJAX code is executed.
You can use something like HtmlUnit, which is basically a GUI-less browser. So, it can handle scripts.
You can try something like this after getting the HtmlUnit library:
String url = "https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco?search=Tachi";
try(final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage(url);
final HtmlUnorderedList list = page.getHtmlElementById("ul_farm_results");
System.out.println(list.asText());
}
I couldn't check the code as the website's certificate is improperly configured and I didn't want to import it's certificate. You may want to take a look at this to resolve the certificate errors.
JSoup does not execute all the scripts, it just gets the HTML returned by the server. What you are looking for is call rendered HTML, that is the HTML produced by the browser after executing all the scripts.
The best solution in Java is to use Selenium with your preferred browser. Selenium was developed for UI testing, it is however very popular as a scraping tool.
A good getting started page is to be found here.
Some code example with Firefox:
WebDriver driver = new FirefoxDriver();
driver.get("https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco");
// Find the element
String id = "ul_farm_results";
WebElement element = driver.findElement(By.id(id));
I'm trying to get some values from a site but these values only appears when I use a Browser, like Mozilla. When I use the Jsoup I can get the HTML from the site but without values, only with the tags.
This is the site I'm trying to parse:
http://www.submarinoviagens.com.br/Passagens/selecionarvoo?Origem=nat&Destino=mia&Data=05/11/2012&Hora=&Origem=mia&Destino=nat&Data=09/11/2012&Hora=&NumADT=1&NumCHD=0&NumINF=0&SomenteDireto=0&Cia=&SelCabin=&utm_source=&utm_medium=&utm_campaign=&CPId=
I'm trying to get the values that appears inside these span tags:
If I access the previous URL from a web browser I can see the following values: '', 'R$ 2634,22' and 'R$ 2634,22', but when I use the following code the values disapears.
URL url = new URL("http://www.submarinoviagens.com.br/Passagens/selecionarvoo?Origem=nat&Destino=mia&Data=05/11/2012&Hora=&Origem=mia&Destino=nat"+
"&Data=09/11/2012&Hora=&NumADT=1&NumCHD=0&NumINF=0&SomenteDireto=0&Cia=&SelCabin=&utm_source=&utm_medium=&utm_campaign=&CPId=");
Document doc = Jsoup.parse(url, 100000);
String title = doc.title();
System.out.println(doc.toString());
If I try to see the source code via Mozilla Firefox the values disapears too.
But If I use the firebug plugin I can see them.
Thank's for the help!
The website uses JavaScript to populate all of the values you are trying to parse. You will have to use a library that can compute the javascript within the page. Not sure if there is one though.
anyone else?
Htmlunit is a headless browser that renders Javascript and should be able to present this page correctly.
I am using jsoup to parse a number of things.
I am trying to parse this tag
<pre>HEllo Worl<pre>
But just cant get it to work.
How would i parse this using jsoup?\
Document jsDoc = null;
jsDoc = Jsoup.connect(url).get();
Elements titleElements = jsDoc.getElementsByTag("pre");
Here is what i have so far.
Works fine for me with latest Jsoup:
String html = "<p>lorem ipsum</p><pre>Hello World</pre><p>dolor sit amet</p>";
Document document = Jsoup.parse(html);
Elements pres = document.select("pre");
for (Element pre : pres) {
System.out.println(pre.text());
}
Result:
Hello World
If you get nothing, then the HTML which you're parsing simply doesn't contain any <pre> element. Check it yourself by
System.out.println(document.html());
Perhaps the URL is wrong. Perhaps there's some JavaScript which alters the HTML DOM with new elements (Jsoup doesn't interpret nor execute JS). Perhaps the site expects a real browser instead of a bot (change the user agent then). Perhaps the site requires a login (you'd need to maintain cookies). Who knows. You can figure this all out with a real webbrowser like Firefox or Chrome.