Extracting web page with jquery content data - java

I need to extract data from particular websites, say it a comment section in a website. What i already tried is extracting html text using jsoup, but since the comment section used jquery it only extract the jquery code not the comments text. Any suggest to solve my problems? thankyou

You can use HTMLUnit to render the page with all needed content and then extract data from build DOMTree. Here you can find info on what to do if AJAX doesn't work OOTB.

Related

Saving page content using selenium

I am using selenium to gather data on a web portal.The problem here is the data is in XML format but the URL extension is not .xml and it is displayed as .aspx since it is a dot net website.Now using selenium I can get the page source by using driver.getPageSource()
But it gives me the format in HTML.Separating the XML here using HTML is really a pain and I have tried many options such as JSoup, but it seems like there is too much parsing to be done.
Is there any other way to make selenium manipulate the browser.I can see that File-Save as gives me an option to save the web page in xml format.How to do this in selenium?Are there any other API's that can help me out here.
Edit : My browser here is Internet Explorer
Have you tried like this ?
String pageSource=driver.findElement(By.tagName("body")).getText();
see this pageSource content If it is giving only XML content you can write it to file using file operations.

How to retrieve dynamically generated web elements?

I fetch the website using Jsoup. Here is the link to the web:
http://www.yelp.com/search?find_desc=restaurants&find_loc=westmont%2C+il&ns=1&ls=43131f934bb3adf3#find_loc=Hinsdale,+IL&l=p:IL:Hinsdale::&sortby=rating&unfold=1
Now I'm trying to extract the number of sub-pages on the web. For example the numbers next to "Go to Page" as shown in the picture below:
Unfortunately either 'view source' in the browser or Jsoup is not able to see these elements. I guess this content is embedded dynamically into the web. If so what is the best way to access dynamically generated web? Thanks.
For the website that use AJAX/JS Libraries technique to generate content, you may want to use HTMLUnit instead (HTMLUnit can simulate Javascript events). JSoup is only for static HTML, or things that you could receive via viewsource.

How to grab data that is not in html source but visible from browser?

The data I want is visible from the browser, but I can't find it from the html source code. I suspect the data was generated by scripts. I'd like to grad such kind of data. Is it possible using Jsoup? I'm aware Jsoup just does not execute Javascript.
Take this page for example, I'd like to grab all the colleges and schools under Academics -> COLLEGES & SCHOOLS.
If the dom content is generated via scripts or plugins, then you really should consider a scriptable browser like phantomjs. Then you can just write some javascript to extract the data.
I didn't check your link, and I assume you're looking for a general answer not specific to any page.

use jsoup get comments from html and save as XML file

I need the help!
Go to page: http://www.tweetfeel.com/
Then type in "linkin park",Used jsoup get all user comments saved as xml?
I'm used java with netbeans.
Not sure if this is possible with jsoup, since the content is generated dynamic by javascript.

Parse html pages and store the contents(title,text and etc) into Database

Does anybody know some open source tools to parse the html pages, filter the Ads,JS and etc to get title, text. Front end of my application is based on LAMP. So I needs to parse the html pages and storage them into Mysql. And populate front pages with these data.
I know some tools: Heritrix, Nutch. But it seems that they are crawlers.
Thanks.
Joseph
It depends on what you mean by "text" from the webpage. I did a similar thing by grabbing a webpage using the apache HttpClient libraries and then dom4j to look for a particular tag to extract text from. But you do in effect need the same type of crawler that search engines like google use. You are emulating the basic steps that they do when they crawl a website. Extracting the information. It would be helpful if you went into a little more detail on what kind of information you want to retrieve from the pages.

Categories