I am trying to get the Hotel reviews from different websites.
For Simple plain HTML Web Pages(like TripAdvisor ) i used JSoup and did like this
Jsoup.connect("foo.html").get();
For Pages which used Javascript for Loading (like Expedia ) i used a Selenium WebDriver and did something like
driver.get("foo.html");
driver.manage().timeouts().pageLoadTimeout(10,TimeUnit.SECONDS)
They were fine because they had links and i could use those links to crawl and get more reviews.
and
The problem i face is when Downloading from pages which make AJAX calls (like MakeMyTrip ).
Here i do not know how to download the web page as the hotels list which is there on the page and keeps coming as we scroll down.
Any Suggestions would be of great help.
Solved it by using the url to which ajax call is done.
Example:
For a hotel with Id 200703241029455940 ( which comes from the main page) we get the review from the url.
http://hotelz.makemytrip.com/makemytrip/site/hotels/detail/responsive/hotelMmtReviews?hotelId=200703241029455940&start=10&rows=10&reviewsType=all
A GET request to the URL returns back a JSON array of reviews and thus i could extract the hotel reviews .
Related
I'm trying to go to the next page on an aspx form using JSoup.
I can find the next button itself. I just don't know what to do with it.
The idea is that, for that particular form, if the next button exists, we would simulate a click and go to the next page. But any other solution other than simulating a click would be fine, as long as we get to the next page.
I also need to update the results once we go to the next page.
// Connecting, entering the data and making the first request
...
// Submitting the form
Document searchResults = form.submit().cookies(resp.cookies()).post();
// reading the data. Everything up to this point works as expected
...
// finding the next button (this part also works as expected)
Element nextBtn = searchResults.getElementById("ctl00_MainContent_btnNext");
if (nextBtn != null) {
// click? I don't know what to do here.
searchResults = ??? // updating the search results to include the results from the second page
}
The page itself is www.somePage.com/someForm.aspx, so I can't use the solution stated here:
Android jsoup, how to select item and go to next page
I was unable to find any other suggestions.
Any ideas? What am I missing? Is simulating a click even possible with JSoup? The documentation says nothing about it. But I'm sure people are able to navigate these type of forms.
Also, I'm working with Android, so I can't use HtmlUnit, as stated here:
importing HtmlUnit to Android project
Thank you.
This is not Jsoup work! Jsoup is a parser with a nice DOM API that allows you to deal with wild HTML as if it were well-formed and not crippled with errors and nonsenses.
In your specific case you may be able to scrape the target site directly from your app by finding links and retrieving HTML pages recursively. Something like
private void scrape(String url) {
Document doc = Jsoup.connect(url).get();
// Analyze current document content here...
// Then continue
for (Element link : doc.select(".ctl00_MainContent_btnNext")) {
scrape(link.attr("href"));
}
}
But in the general case what you want to do requires far more functionality that Jsoup provides: a user agent capable of interpreting HTML, CSS and Javascript with a scriptable API that you can call from your app to simulate a click. For example Selenium:
WebDriver driver = new FirefoxDriver();
driver.findElement(By.name("next_page")).click();
Selenium can't be bundled in an Android app, so I suggest you put your Selenium code on a server and make it accessible with some REST API.
Pagination on ASPX can be a pain. The best thing you can do is to use your browser to see the data parameters it sends to the server, then try to emulate this in code.
I've written a detailed tutorial on how to handle it here but it uses the univocity HTML parser (which is commercial closed source) instead of JSoup.
In short, you should try to get a <form> element with id="aspnetForm", and read the form elements to generate a POST request for the next page. The form data usually comes out with stuff such as this:
__EVENTTARGET =
__EVENTARGUMENT =
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
... and other gibberish
Then you need to look at each one of these and compare with what your browser sends. Sometimes you need to get values from other elements of the page to generate a similar POST request. You may have to REMOVE some of the parameters you get - again, make your code behave exactly the same as your browser
After some (frustrating) trial and error you will get it working. The server should return a pipe-delimited result, which you can break down and parse. Something like:
25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
<div style="font-weight: bold;">
... more stuff
|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... another very long string ...1Pni|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A| other gibberish
From THAT sort of response you need to generate new POST requests for the subsequent pages, for example:
String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");
Then:
request.setDataParameter("__VIEWSTATE", viewState);
There are will be more data parameters to get from each response. But a lot depends on the site you are targeting.
Hope this helps a little.
The problem I'm having is the following:
I have an app with two separate modes: A WebView for browsing and a custom Canvas. The custom Canvas captures handwriting samples for language placement exams. Here's how it works. A user logs in to Moodle via the WebView. After they log in, they navigate to a Quiz inside Moodle. They click a link on one of the Quiz's questions and this launches an Intent which hides the WebView and shows the Canvas. The user then writes (using a stylus) on the Canvas. When a user is finished writing their essay (or whatever), they press a button that uploads an image file to Moodle. I am able to upload images to a point, it's getting them to show up in the HTML page that the user clicked the link in originally (see above) and to get Moodle to commit them to permanent storage that is the problem. Normally this is all accomplished through AJAX (really AJAJ since it's JavaScript and JSON) and when the user drops a file on this one component, the component refreshes and uploads the file.
Here is the problem: I need the WebView so that students can log in to Moodle through Shibboleth. But because the underlying JavaScript in the browser makes AJAX calls to the Moodle server and since the Java side of Android doesn't have access to the DOM, I have use the Apache HTTP components library to make some of the connections below basically to preserve the state of the HTML page in WebView.
In a desktop browser on, say, Windows, I use WebScarab to monitor the browser's requests and this is what I see: the browser uploads a file to Moodle via five successive calls to the following scripts:
POST https://[moodle website]/repository/repository_ajax.php [posts multipart form data]
POST https://[moodle website]/repository/draftfiles_ajax.php [posts some params]
GET https://[moodle website]/draftfile.php/[some_id]/user/draft/[some_id]/[somefilename.png] [returns an icon of the image for a filepicker from YUI]
POST https://[moodle website]/mod/quiz/processattempt.php [returns HTML page]
GET https://[moodle website]/mod/quiz/summary.php [returns HTML page]
Some of these scripts return, as you'd expect, JSON data since they're AJAX and not HTML. The final two calls (4 & 5) return HTML. Now, I can make all of those calls in succession in either the WebView or the Apache HTTP library, but if I do so with WebView, only JSON data is returned to the WebView in calls 1-3 (WebView treats the JSON data as a page and displays it wiping out whatever HTML page was displayed in it). If I capture and process the JSON data using the Apache HTTP library in Java, then the JavaScript components internal to the page do not get updated. If I split the calls so that I send only calls 4 & 5 to the WebView, the HTML merely returns WebView to the first question of the exam and Moodle acts as if I haven't uploaded anything.
I can verify that files are uploading if I manually refresh (press a link) the JavaScript UI elements in the page. I can't expect students to do this, though, because the link to do so is very tiny and it's not obvious that it does a refresh. I need a way to programmatically refresh this one element (it's part of YUI) or to get Android and the Java side to play more nicely with the JavaScript/DOM side.
My question is: does anyone know a way to 1) fire off a drag and drop event using YUI to an element inside an HTML page or 2) a way to consume the JSON data and pass it to an element inside the HTML page.
I'm banging my head against a wall trying to figure this out.
OK, so I figured out that: javascript:document.getElementsByClassName(\"[name of link here]\")[0].click() works in Chrome on the desktop but doesn't work if I pass it to WebView.loadURL(). I just need to be able to simulate that click event reliably in WebView. It appears not to support click(). Anyone have any ideas?
The winning code is:
el = document.getElementsByClassName("[some element]")[0];
var event = document.createEvent("MouseEvents");
event.initEvent("click", true, true);
el.dispatchEvent(event);
This selects the link at [some element] and thereby fires an AJAX request that refreshes the FilePicker. For those working with Moodle, I had to add the above code to the same quiz question that handles so it is invoked by putting that code in its own function and calling it with WebView.loadURL("javascript:myRefreshFunction()").
I'm developing with Liferay portal.
And now I'm facing a little problem:
I'm making site for some Company that has subsidiaries.
Then, I must cut out some parts(precisely header and footer)
of other site(sub. site) and put the body of page without'em in iframe of main site.
I was "googling", looking for something about Grabbers.
but I've found just about how to grab with PHP or Perl.
and here
It doesn't seem to be exact what I need.
You can try the WebProxy portlet for this.
As you'll have to modify the external content's body, you can't simply show it in an iframe, so this portlet might be what you need. It doesn't work with an iframe internally and you can replace some content on-the-fly.
I use Jsoup to scrap the website:
doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();
Here is the link:
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
I have added rpp=40 parameter to the link in the command line to display 40 results per page. I'm able to see all the results in page view source.
I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? How can I obtain every elements visible via page view source.
Short answer Jsoup can't execute the Javascript.
Long answer
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
The webpage your are looking for accepts the Http Get with the parameters. In the normal browser it accepts the params and loads the page . But Not with willowbrook checked(in your example). It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered.
I have developed a theme in liferay 6.1. I have a page named "localhost:8080/home" but now i want that on clicking this link of the page, it should be redirected to localhost:8080
Any suggestions are welcomed.
Thanks in Advance.
I think you are confused a little bit, so just some things you should know:
You can't (normally and without hacks) have a page named "localhost:8080". Every Page (or 'Layout' in Liferay) has a short name, that takes it's part of the url. This is often called "friendly url" but it's often confused with the "friendly url feature", which is a way to shorten your url request data.
So you're always going to have urls like 'localhost:8080/something'. The same holds for the 'home' page
You can partially shorten the Url by using 'virtual host'. It removes the part of the url before your page's name (like removing the web/guest or user/username ) suffix
You can use the 'friendly url' feature to shorten the part of the url that goes after the page's name, and contains request information like lifecycle state info or custom request parameters