Parsing html with page hash navigation - java

I have to parse an html page. I get it via url such as "https://..bla-bla../final_page/#hash-command". But JSoup gets page without hash-command, just final_page, but it's not the page i want to parse. How can I make a request to get the right responce? (I also tried to use okHttp to establish a connection, but it wasn't make any result).

Related

Reading HTML using jsoup

so i am trying to get an HTML element from a website using Jsoup, but the HTML that i get from the Jsoup.connect(url) is not complete compared to the one that i get using the inspector on the website.
EDIT : this is the link i'm working with https://www.facebook.com/livemap##35.831640894,24.82275312499999,2z
The numbers in the end designate the coordinates of the map, and you don't have to sign in to access the page, so there is no authentication problem
UPDATE :
So i have found that the element that i want does not get expanded using jsoup, is this a problem related to slow page loading ? If so, how can i make sure that Jsoup.connect(url) fully loads the webpage before fetching the HTML
from inspector (the <div id="u_0_e"> is expanded)
from jsoup.connect (the <div id="u_0_e"> is not expanded)
Jsoup dont execute javascript or jQuery events, so you will get a initial page before executing javascript.

HtmlUnit AJAX response is partial HTML. Its failing parsing

The response of an AJAX response in HtmlUnit is a single div, which contains a table of data.
The response also contains a small JS script.
The problem is that HtmlUnit is trying to parse the response as a complete HTML. So, it expects that snippet to have all JS libraries like jQuery.
Is there a way to parse the snippet in the context of the parent page which fired the AJAX?
Alternatively, it would be ok if I just got the response as plain text. But, the request has to be within the session, along with all the Html headers intact.
I guess the AJAX url can't be used for getPage. Instead, I loaded the outer page once, and triggered a script on that page, which loaded the AJAX response into a div.
String jsCommand = "$('#results_box').load( '"+ pageLink +"',"+ formdata +");";
parentPage.executeJavaScript(jsCommand);

Java HTML Parsing a Page with Infinite Scroll

How can I grab a page's HTML in java if the page has infinite scroll? I'm currently grabbing a page this way:
URL url = new URL(stringUrl);
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();
encoding = encoding == null ? "UTF-8" : encoding;
String html = IOUtils.toString(in, encoding);
Document document = Jsoup.parse(html);
But it doesn't return any of the content associated with the infinite scroll section of the page. How can I trigger this scrolling on the HTML page so that my Jsoup document contains this section?
Infinite scroll describes a technique where the page does not contain the content. Some JavaScript code runs in the browser, sends a request to the server for addiional content and adds it to the page. When you scroll towards the end of the available content, the JavaScript code repeats the process: it sends another request and adds the additional content.
Therefore, you need a web browser with a JavaScript engine that can run the JavaScript code and produce the events that cause the code to load content.
#dsh is right, the content is most likely loaded dynmically via AJAX. As an alternative to using a real browser, i.e. selenium webdriver, you may look into the network traffic and idetify the API call that the page triggers. You can maybe call that Api directly with Jsoup. Often the content is however not HTML but JSON, XML or some other format. It still may be very worth while doing this, since using webdriver is often pretty slow and resource-heavy.

Fetching the website with Jsoup - page view source and Jsoup shows different content

I use Jsoup to scrap the website:
doc = Jsoup.connect(String.valueOf(urls[0])).userAgent("Mozilla").get();
Here is the link:
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
I have added rpp=40 parameter to the link in the command line to display 40 results per page. I'm able to see all the results in page view source.
I know that Jsoup is for the static content only and cannot fetch the websites that use AJAX/JS Libraries technique to generate content. However why Jsoup cannot retrieve the same content as I can see in the browser via page view source? Page view source shows 40 results whereas Jsoup is able to retrieve elements from only 10 results? How can I obtain every elements visible via page view source.
Short answer Jsoup can't execute the Javascript.
Long answer
http://www.yelp.com/search?find_desc=restaurant&find_loc=willowbrook%2C+IL&ns=1#l=p:IL:Willowbrook::&sortby=rating&rpp=40
The webpage your are looking for accepts the Http Get with the parameters. In the normal browser it accepts the params and loads the page . But Not with willowbrook checked(in your example). It loads the JS after it loads the page and the Javascript does the check box for Fliters the serach results. Therefore when you use Jsoup you are getting more results because it loads 'state=IL' without 'willowbrook' filtered.

HTML page as AJAX response?

Can I send an entire HTML page with an AJAX response? If so, how to render that HTML page and what are the pros and cons doing that. The reason why I am asking this question is I found out that if we use response.sendRedirect("index.html") as a reply to an AJAX request in servlet we get the indx.html as AJAX response XML.
Your question doesn't make much sense and I can't quite tell what you're asking - the response to an ajax request will be whatever the server sends back. It could be plain text, XML, HTML, a fragment of an HTML/XML document etc. What you can do with depends on your script. If you're using a library like jQuery, what happens on the client side and what you can do with the response can also depend on how the library interprets the response (Is it a script? It it HTML/XML or JSON?).
if we use response.sendRedirect("index.html") as a reply to ajax request in servlet we get the indx.html as ajax response xml. Can some one pls explain this
An ajax request will behave much like a 'regular' HTTP request. So when you send back a redirect from your server (HTTPServletResponse#sendRedirect), the browser follows the redirect just like it would for any other request. If your ajax request was to a resource that required HTTP BASIC authentication, you'd see a prompt to login, just like you would if you visited the URL directly in a new browser window.
If you want to send HTML as a response, because you want to update divs, tables or other elements, but still want to use the same css or javascript files then it can make sense.
What you can do is to just send it as plain/text back to the javascript function, it can then take that and put it into the inner html element that you want to replace, but, don't do this if you want to replace the entire page, then doing what you want is pointless.
When you make the http request for your ajax call, it has its own response stream, and so when you redirect you are just telling the browser to have that http request go to index.html, as #no.good.at.coding mentioned.
If I get your question, you just want to know whether you could return whole entire HTML with AJAX and know the pro and cons.
The short answer to your question is yes, you could return the entire HTML page with your AJAX response as AJAX is just an http request to the server.
Why would you want to get the entire HTML? That's confusing to me and that is the part that I am not clear about. If you want to want to render the entire HTML (including tags like html, body, etc?), you might as well open it as a new page instead of calling it via Ajax.
If you are saying that you only want to get fragments of HTML to populate a placeholder in your page via AJAX then this is an acceptable practice. jQuery even provides load() function (http://api.jquery.com/load/) to make that task easy for you to use (check the section Loading Page Fragments).
Using this method, you could populate the placeholder using the HTML Fragments that is dictated by your server logic (i.e when the login fails or succeed) including the one via server redirect in your question.

Categories