Scraping Yahoo Answers with Jsoup - java

I am trying to scrape results of keyword search from Yahoo Answers, in my case, "alcohol addiction." I am using Jsoup and URL modification to go through pages of the search results to scrape the results. However, I am noticing that, even though I put in URL for 'Newest' results, it keeps showing 'Relevance' results, and what's worse, the results are not exactly the same as what's shown on the browser.
For instance, the URL for Newest results is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=new
And for relevant results, the URL is:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=1&sort=rel
And the "1" will change to 2, 3, 4, etc as you go to the next page (there are 10 results per page).
Here's what I do to scrape the page:
String urlID = "";
String end = "&sort=new";
String glob = "http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=";
Integer forumID = 0;
while(nextPageIsThere){
forumID++;
System.out.println("Now extracting the page: "+forumID);
try {
urlID = glob+forumID+end;
System.out.println(urlID);
exdoc = Jsoup.connect(urlID).get();
java.util.Date date= new java.util.Date();
} catch (IOException e) {
e.printStackTrace();
}
...
What's even more confusing is even if I increase the page number, and the system output shows that the URL is changing to:
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=2&sort=new
and
http://answers.yahoo.com/search/search_result?p=alcohol+addiction&s=3&sort=new
it still scrapes the same page as shown in page 1 over and over again. I know my code is not wrong. I've been debugging it for hours. I think it's something got to do with Jsoup.connect and/or Yahoo Answer possibly blocking bots? At the same time, I don't think it's really that.
Does anyone know why this might be happening?

JSoup is working with static HTML only, they can't parse dynamic pages like this, where content is downloaded after page is loaded with Ajax request or JavaScript modification.
Try reading this page with HTMLUnit, this parser has support for JS pages.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.

Related

How to read dynamic website content in java

As per the html the source code is:
{result.data}
While requesting the URL result.data is set with 100 and am able to see the value as 100 in the browser. Where as while I am trying to execute the java program with the same url request I am unable to see the value as I have seen in browser.
URL url = new URL(site)
url.openConnection() etc..
I wanted to get the same content as I have seen in the browser through java program.
Your question is not very descriptive, but I guess you are trying to scrape data from the site.
You can use the following libraries for this task:
Jaunt (http://jaunt-api.com)
Jsoup (http://jsoup.org/cookbook/extracting-data/dom-navigation)
HTMLUnit
To what i understand, you want to do one of the below things :
Instead of reading the result line by line, you want to parse it as an XML to as to traverse to div(s) and other html tags.
For this purpose i would suggest you to use jsoup library.
When you hit the URL: www.abcd.com/number=500 in browser, it loads an empty div and on load it fetches data from somewhere, this data which it fetches on load, you want to fetch this using java ?
For this, there must be some js in the resulting page, which is fetching data by hitting some service on page load, you will need to look up in the page to know the service details and instead of hitting this URL (www.abcd.com/number=500) you will need to hit that service to get data.

jsoup - Cleaning HTML with missing and broken tags

I am looking for a way to clean HTML text that may have some missing or broken tags in them. These are usually written by non-programmers and there can be a number of problems with the HTML. Here is what I've tried:
Parser p = Parser.htmlParser();
String test = "Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>";
Document d = p.parseInput(test, StringUtils.EMPTY);
System.out.println("BEFORE: " + test);
System.out.println("JSPARSED: " + StringUtils.remove(d.body().html(), "\n"));
System.out.println("JSOUP: "+ Jsoup.clean(test, StringUtils.EMPTY, Whitelist.relaxed()));
Output is:
BEFORE: Here is a <i>fake</> message.<br><b><i>- Publisher</b></i>
JSPARSED: Here is a <i>fake message.<br><b><i>- Publisher</i></b></i>
JSOUP: Here is a
<i>fake message.<br><b><i>- Publisher</i></b></i>
The desired output is:
Here is a <i>fake</i> message.<br><b><i>- Publisher</i></b>
Is it possible to clean the HTML for the above situations using jsoup?
EDIT: To add a bit more context, this HTML block is displayed on our website as a description for a product. This is usually written by the marketing team or publisher and at times have some mistakes in the HTML. We currently use JTidy for HTML cleanup before displaying it on the website.
I recently ran a program to see how many products have an error in the description and found roughly 30,000 products with errors. After reviewing some of them, I saw that the majority of the errors are tags in the wrong order (which the program fixes) but errors where tags are missing or broken as shown in the example, were not fixed as intended.
It is not likely you will ever get consistent results with autocorrecting 30k of malformed HTML snippets. Chances are, you will get even more screwed up content.
Do yourself a favor:
Forbid to save broken HTML for new/edited descriptions, programmatically.
Hire someone to correct these manually (or delegate to marketing team that put errors in the first place).

HTML Unit - read from a normal string?

I want to use HTML Unit for JAVA.
In all examples there will be read the HTML Code from a specific website.
But I want to read the HTML source from another String.
Like this:
String myString = "<html> myString and Content </html>";
HtmlPage page = myString; // doesn´t work, how can I do something like this?
I see only examples like this:
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
Can I also read only a Table?
Like this:
String myTable = "<table><td></td></table>";
HtmlTable table = myTable; // doesn´t work, how can I do something like this?
My question is now, how can I convert this correct?
Can anybody help me, please.
HtmlUnit isn't really designed for this use case, so it will always be a bit of a hassle to make it work. If you're not tied to HtmlUnit specifically, you might be better off using something like jsoup, which has better built-in support for parsing HTML from strings.
That said, if you are tied to HtmlUnit, it's possible to make this work. For inspiration, you could look at how HtmlUnit sets up HtmlPage objects in its own test suite.
As you can see there, although there's no way to construct an HtmlPage directly from a String, you can make a MockWebConnection that'll give a canned response without involving the network. So your code could look something like this:
String html = "<html>Your html here</html>";
WebClient client = new WebClient();
MockWebConnection connection = new MockWebConnection();
connection.setDefaultResponse(html);
client.setWebConnection(connection);
HtmlPage page = client.getPage(someUrl);
(Apologies for any errors in the above -- I'm no longer on a Java project, so I don't have a convenient way to test this right now. That said, I did spend some time on a large Java project that used roughly this technique for a lot of tests. It worked reasonably well, but it tended to be a bit fragile when we upgraded HtmlUnit. Overall, we were happier when we moved to Jsoup.)
Here is another way of doing it, similar to Collum's but a little different.
WebClient webClient = new WebClient();
URL url = new URL("http://example.com");
WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);
StringWebResponse response = new StringWebResponse("<html> myString and Content </html>", url);
HtmlPage page = HTMLParser.parseHtml(response, webClient.getCurrentWindow());
As for getting the table, it is possible. You can load it with the method above and extract it with the code below.
HtmlTable table = page.getHtmlElementById("table1");
you can iterate over and cells with the code below
for (final HtmlTableRow row : table.getRows()) {
System.out.println("Found row");
for (final HtmlTableCell cell : row.getCells()) {
System.out.println(" Found cell: " + cell.asText());
}
}
and you can access specific cells with the example below
System.out.println("Cell (1,2)=" + table.getCellAt(1,2));
Please comment if you get stuck and I may be able to help

What URL do I use to open a String object in a web browser

If I have a HTML String object, using Selenium in Java, how can I get the browser to open that String as a HTML page? I have seen this done before but I don't remember the format that the URL needs to be.
For this example, let's say the string is :
<h2>This is a <i>test</i></h2>
I looked through this page and couldn't find the answer but I might be overlooking it. For example I tried this URL and it didn't work for me:
data:<h2>This is a <i>test</i></h2>
Here is a link for documentation http://en.wikipedia.org/wiki/Data_URI_scheme. You need to specify MIME-type of data. Try data:text/html,<h2>This is a <i>test</i></h2>

Scraping Data. Save File?

I am trying to scrape data from a website which uses javascript to load much of their content. Right now I am using jSoup to parse html pages, however since much of the content is loaded using javascript I haven't been able to parse the data I want.
How should I go about getting this javascript content? Should I first save the page then load and parse it using jSoup? If so, what should I use to load javascript content before I save? Is there an API which you would recommend that could output html?
Currently using java.
You might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context - among other things, you can define a "ready" function for the page and wait to scrape until the function (which might check for the existence of certain DOM elements, etc) returns true.
The other option, depending on the page, is to use a console like Firebug to figure out what data is being loaded (i.e. what URLs are being retrieved by the AJAX calls on the page), and scrape the data directly from those URLs.
If the data are generated with javascript then the data are in the downloaded page.
Better is to directly parse them on the fly as you do with plain HTML or Text parsing.
If you cannot isolate tokens with jSoup API just parse them using direct String options, as a plain text.
I tried using htmlUnit however I found it very slow.
I ended up using the curl command line function within java which worked for my purposes.
String command = "curl "+url;
Process p = Runtime.getRuntime().exec(command);
BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((s = stdInput.readLine()) != null) {
html = html+s+"\n";
}
return html;

Categories