Can you use Jsoup to submit a search to Google, but instead of sending your request via "Google Search" use "I'm Feeling Lucky"? I would like to capture the name of the site that would be returned.
I see lots of examples of submitting forms, but never a way to specify a specific button to perform the search or form submission.
If Jsoup won't work, what would?
According to the HTML source of http://google.com the "I am feeling lucky" button has a name of btnI:
<input value="I'm Feeling Lucky" name="btnI" type="submit" onclick="..." />
So, just adding the btnI parameter to the query string should do (the value doesn't matter):
http://www.google.com/search?hl=en&btnI=1&q=your+search+term
So, this Jsoup should do:
String url = "http://www.google.com/search?hl=en&btnI=1&q=balusc";
Document document = Jsoup.connect(url).get();
System.out.println(document.title());
However, this gave a 403 (Forbidden) error.
Exception in thread "main" java.io.IOException: 403 error loading URL http://www.google.com/search?hl=en&btnI=1&q=balusc
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:387)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at test.Test.main(Test.java:17)
Perhaps Google was sniffing the user agent and discovering it to be Java. So, I changed it:
String url = "http://www.google.com/search?hl=en&btnI=1&q=balusc";
Document document = Jsoup.connect(url).userAgent("Mozilla").get();
System.out.println(document.title());
This yields (as expected):
The BalusC Code
The 403 is however an indication that Google isn't necessarily happy with bots like that. You might get (temporarily) IP-banned when you do this too often.
I'd try HtmlUnit for navigating trough a site, and JSOUP for scraping
Yes it can, if you are able to figure out how Google search queries are made. But this is not allowed by Google, even if you would success with that. You should use their official API to make automated search queries.
http://code.google.com/intl/en-US/apis/customsearch/v1/overview.html
Related
I am trying to figure out how to submit a form using Jsoup.
On Xfinity's website, I am trying to input an address and get back the resulting page after clicking on "Show me deals" from the url below:
https://www.xfinity.com/learn/offers
Here is my current code:
public String getISP() throws IOException {
Connection.Response addressFormResponse = Jsoup.connect("https://www.xfinity.com/learn/offers")
.data("Address.SingleStreetAddress", address)
.method(Connection.Method.POST)
.execute();
Document doc = addressFormResponse.parse();
System.out.println(doc.title());
System.out.println(doc.location());
if (doc.location().contains("Active Address")) {
return "Comcast XFinity";
}
return "Cannot find an ISP";
}
The current code only returns the same webpage, how would I get back the resulting page?
Jsoup is a HTML parser library, it provides functionality for extracting and manipulating data on HTML page. If you need traverse websites, submit forms, click elements, it's better to use another tools, like selenium, HTTP client (which are often used for automated test of web applications) or web crawler libraries like crawler4j.
I would tend to disagree with Daniil's answer in that neither HTTP Client or crawler4j support javascript which is required for this page. Selenium is probably the best solution.
What follows is an example of how to use jsoup to fetch a page, fill out a form, and submit it. The result is json and so you would then pass that string to gson or similar. I did not that the page was very flaky just in a regular browser, and sometimes would catch the address input and sometimes would barf on the same input.
Document doc = Jsoup.connect("https://www.xfinity.com/learn/offers").get();
FormElement form = (FormElement) doc.selectFirst("[data-form-dealfinder-localization]");
Element input = form.selectFirst("#Address_StreetAddress");
input.val("2000 YALE AVE E, SEATTLE, WA 98102");
String json = form.submit().ignoreContentType(true).execute().body();
System.out.println(json);
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
Image explaining the data to be extracted
I'm trying to extract data from a web page (marked red in the image) using HtmlUnit library of java. But I can't get that particular value.
WebClient webClient = new WebClient(BrowserVersion.CHROME);
Thread.sleep(5000);
HtmlPage page = webClient.getPage("https://earth.nullschool.net/#current/wind/isobaric/500hPa/orthographic=-283.71,14.19,2183/loc=76.850,11.440");
Thread.sleep(5000);
System.out.println(page.asXml());
I checked the html which I got on console window. It doesn't contain the value.
<p>
<span id="location-wind" class="location">
</span>
<span id="location-wind-units" class="location text-button">
</span>
</p>
It's because these are filled in via JavaScript. When you load the page, these fields are initially empty. You can check this by looking at the source code and searching for id="location.
The page makes two additional HTTP requests to fetch dynamic data:
https://earth.nullschool.net/data/earth-topo.json?v3
https://gaia.nullschool.net/data/gfs/current/current-wind-isobaric-500hPa-gfs-0.5.epak
Somewhere in this data (and combined they are around 1.2 MB) is the data that you're looking for. Your best bet is to use a tool (perhaps an online one) to convert the JSON to a Java object, or to study the JSON and write code to get the specific data that you're after.
That is, if that data is in the JSON, which I'm not convinced about. The EPAK file appears to be some sort of binary data with embedded JSON, but I couldn't figure out if the data is perhaps in there.
Another approach is to use Selenium, have it parse the page for you, and retrieve the data from there.
I have a pet project i'm working on having to do with espn fantasy football. Anywho my league is private and it requires that I login to the site before I can navigate to the page. For instance on the browser when I go to
http://games.espn.go.com/ffl/standings?leagueId=491518&seasonId=2014
I get redirected to a login page. I'm trying to use jsoup to login and scrape some data off the webpage but I can't get past the login issue. No matter what I try i keep getting redirected.
I inspected the POST and GET requests and found some parameters in the form data in addition to username and password like "submit", "multipleDomains", "cookieDomain", etc. I'm not sure if I need to set those or not, I tried but it didn't either way or I did something wrong. While inspecting I found the login address to be
https://r.espn.go.com/espn/memberservices/pc/login
So when I use that address i don't get redirected but it does not return any cookies that I can use in subsequent requests to bypass the redirection.
I'm looking for some guidance or to see if anyone has had success doing this? I've seen all the "jsoup login examples" and tried several of them but none seem to work. Any help or pointers would be greatly appreciated. Maybe there's a better way/tool other than jsoup? I'm not hard set on jsoup it just seems to be pretty popular and stable.
I hope this helps someone out as well but it turned out that I did miss one of the items in the Form Data :/ In case anyone is trying something similar with ESPN the proper form data elements are
failedAttempts
SUBMIT
failedLocation
aff_code
appRedirect
cookieDomain
multipleDomain
username
password
submit
All these can be found using Chrome's developer options and inspecting the login headers.
I'm trying to fill-out a form automatically and press a button on that form and wait for a response. How do I go about doing this?
To be more particular, I have a a --HUGE-- collection DNA strains which I need to compare to each-other. Luckily, there's a website that does exactly what I need.
Basically, I type-in 2 different sequences of DNA and click the "Align Sequences" button and get a result (the calculation of the score is not relevant).
Is there a way to make a Java program that will automatically insert the input, "click" the button and read the response from this website?
Thanks!
You can use the apache http client to send a request to a web site.
Look at the source to the page in question, and you'll find the part. This contains all the fields that need to be sent to the server. In particular, you'll see that it needs to be sent as a Post, rather than the more common Get. The link above shows you how to do a post with the http client code.
You'll need to provide a nameValuePair for every field in the form, such as these ones:
<input type="hidden" name="rm" value="lalign_x"/>
<input type="checkbox" name="show_ident" value="1" />
<textarea name="query" rows="6" cols="60">
It will probably take some trial and error for you to get all the fields set up correctly. I'd recommend doing this with small data sets. Once it all seems to be working, then try it with your bigger data.
In Python you can use mechanize library (http://wwwsearch.sourceforge.net/mechanize/). It's quite simple and you doesn't need to know Python very well to use it.
Simple example (filling login form):
br = Browser()
br.open(login_link)
br.select_form(name="login")
br["email"] = "email#server.com"
br["pass"] = "password"
br.submit()
You could probably do this using Selenium.