This question already has answers here:
How to code an automated bot that can browse and do operations on a webpage
(6 answers)
Closed 7 years ago.
I am new to writing code and I am trying to write code to scrape a specific website. The issue is that this website has a page to accept the conditions of use and privacy page. This can be seen by the website: http://cpdocket.cp.cuyahogacounty.us/
I need to bypass this page somehow and I have no idea how. I am writing my code in Java, and so far have working code that scrapes the source for any website. This code is:
import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;
// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper {
private static String url; // the input website to be scraped
//constructor
public Scraper(String url) {
this.url = url;
}
//scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
//so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection(); // connects to the created url
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
String inputLine; //creates a new variable of string
StringBuilder a = new StringBuilder(); // creates stringbuilder
//loop appends to the string builder as long as there is information
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}
}
Any suggestions on how to go about doing this would be greatly appreciated.
I am rewriting the code based off a ruby code. The code is:
def initializeSession()
## SETUP # POST headers
post_header = Hash.new()
post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
post_header['Accept-Language'] = 'en-US,en;q=0.5'
post_header['Accept-Encoding'] = 'gzip, deflate'
post_header['X-Requested-With'] = 'XMLHttpRequest'
post_header['X-MicrosoftAjax'] = 'Delta=true'
post_header['Cache-Control'] = 'no-cache'
post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
# post_header['Content-Length'] = '12197'
post_header['Connection'] = 'keep-alive'
post_header['Pragma'] = 'no-cache'
# STEP # set up simulated browser and make first request
#browser = SimBrowser.new()
#logname = 'log.txt'
#s = Scribe.new(logname)
session_cookie = 'ASP.NET_SessionId'
url = 'http://cpdocket.cp.cuyahogacounty.us/'
#browser.http_get(url)
#puts browser.get_body() # debug
puts 'DEBUG: session cookie: ' + #browser.get_cookie_var(session_cookie)
#log.slog('DEBUG: home page response code: expected 200, actual ' + #browser.get_response().code)
# s.flog('### HOME PAGE RESPONSE')
# s.flog(browser.get_body()) # debug
# STEP # send our acceptance of the terms of service
data = {
'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
'__EVENTARGUMENT'=>'',
'__EVENTTARGET'=>'',
'__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
'__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
}
#post_header['Referer'] = url
#browser.http_post(url, data, post_header)
#log.slog('DEBUG: accept terms response code: expected 200, actual ' + #browser.get_response().code)
#log.flog('### TOS ACCPTANCE RESPONSE')
# #log.flog(#browser.get_body()) # debug
end
can this be done in Java as well?
If you don't understand how to do this, the best way to learn is to do this manually while watching what happens with FireBug (on Firefox) or the equivalent tools for IE, Chrome or Safari.
You must duplicate in your code whatever happens in the protocol when the user accepts the terms & conditions manually.
You must also be aware that the UI presented to the user may not be sent directly as HTML, it may be constructed dynamically by Javascript that would normally run on the browser. If you are not prepared to fully emulate a browser to the point of maintaining a DOM and executing Javascript, then this may not be possible.
Related
I am trying to download all the images off of a site, but I'm not sure if this is the best way, as I have tried setting a user agent and referrer to no avail. The 403 Status Error only occurs when trying to download the images from the src page, while the page that has all the images in one place is doesn't show any errors and sends the src to the images. I am not sure if there is a way to download the images without visiting the src page? Or a better way to do this entirely.
Here is my code so far.
private static void getPages() throws IOException {
Document doc = Jsoup.connect("https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686")
.get();
Elements media = doc.getElementsByTag("img");
System.out.println(media);
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
Response resultImageResponse = Jsoup.connect(ie.next().attr("src")).ignoreContentType(true)
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0")
.referrer("www.google.com").timeout(120000).execute();
FileOutputStream out = (new FileOutputStream(new java.io.File("image #" + i++ + ".jpg")));
out.write(resultImageResponse.bodyAsBytes());
out.close();
}
}
You have a few problems with your suggested approach:
you're trying to use JSoup to download file content data... JSoup is only for the text data but won't return the image content/values. To download image content you will need an HTTP request
to download the images you also need to copy the request that would be made via a browser. You can open up Chrome, open developer tools and open the network tab. Enter the URL for the page you want to scrape images from, and you'll see a bunch of requests being made. There'll be an individual request for each image somewhere in the view... if you click on the one labelled 1.jpg you'll see the request made to download the first image, you'll then need to copy all headers that are used to make the request for that image. You'll note, request AND response headers are shown in this view. Once you've replicated the request successfully, you can then start testing which headers/cookies are required. I found the only real requirement was for the "referer" header being necessary.
I've stripped out most of what you might need/want but something similar to the below is what you're after. I've pulled the comic book images in their entirety at full quality. I introduced a small sleep timer so as not to overload the server as sometimes you'll get rate limited. Even without it you should be fine but you don't want to get blocked for a lengthy period of time so the slower you can allow the requests to come back to you the better. You could even make the requests in parallel.
You could cut back even more on some of the code below I'm almost certain, to get a cleaner result... but it works and I'm assuming that's more than enough of a result.
Interesting question.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Iterator;
public class JSoupExample {
private static int TIMEOUT = 30000;
private static final int BUFFER_SIZE = 4096;
public static void main(String... args) throws InterruptedException, IOException {
String url = "https://manganelo.com/chapter/read_bleach_manga_online_for_free2/chapter_686";
Document doc = Jsoup.connect(url).get();
// Select only urls where the source starts with the relevant url (not all images)
Elements media = doc.select("img[src^=\"https://s5.mkklcdnv5.com/mangakakalot/r1/read_bleach_manga_online_for_free2/chapter_686_death_and_strawberry/\"]");
Iterator<Element> ie = media.iterator();
int i = 1;
while (ie.hasNext()) {
String imageUrlString = ie.next().attr("src");
System.out.println(imageUrlString + " ");
try {
HttpURLConnection response = makeImageRequest(url, imageUrlString);
if (response.getResponseCode() == 200) {
writeToFile(i, response);
}
} catch (IOException e) {
// skip file and move to next if unavailable
e.printStackTrace();
System.out.println("Unable to download file: " + imageUrlString);
}
i++; // increment image ID whatever the result of the request.
Thread.sleep(200l); // prevent yourself from being blocked due to rate limiting
}
}
private static void writeToFile(int i, HttpURLConnection response) throws IOException {
// opens input stream from the HTTP connection
InputStream inputStream = response.getInputStream();
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
}
private static HttpURLConnection makeImageRequest(String referer, String imageUrlString) throws IOException {
URL imageUrl = new URL(imageUrlString);
HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
response.setRequestMethod("GET");
response.setRequestProperty("referer", referer);
response.setConnectTimeout(TIMEOUT);
response.setReadTimeout(TIMEOUT);
response.connect();
return response;
}
}
I'd also want to ensure I set the right file extension based on the content type as I believe some were coming back as .png format rather than .jpeg. I'm also fairly sure the write to file can be cleaned up to be simpler/clearer, rather than reading in a byte stream.
I'm trying to create a simple project where the user inputs a URL and I fetch the relevant information (author, title, etc.) for a citation. The problem is that the Java URL library doesn't seem to fetch the entire page source. For example, I'll use the link https://www.cia.gov/library/publications/the-world-factbook/geos/jo.html as a reference. Here's the code I'm using:
import java.net.*;
import java.io.*;
import java.util.ArrayList;
public class URLTester
{
private static URL url;
public URLTester(URL u)
{
url = u;
}
public static ArrayList <String> getContents() throws Exception
{
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
ArrayList <String> arr = new ArrayList<String>();
while ((inputLine = in.readLine()) != null)
{
arr.add(inputLine);
}
in.close();
return arr;
}
public static void main (String args[]) throws Exception
{
url = new URL("https://www.cia.gov/library/publications/the-world-factbook/geos/jo.html");
ArrayList<String> contents = getContents();
for(int i = 0; i < contents.size(); i++)
{
System.out.println((contents.get(i)));
}
}
}
This fetches what appears to be a shortened version of the page source for the target. When I pressed 'view page source' on the site, a much more expanded version came up, including information such as the date and the author of the article. I can't paste the source here, because it'll exceed the character limit. How can I get the entire page source, instead of a shortened version?
The problem is due to console character limit exceed.
The default limit is 80000 character in Eclipse.
To change the preference, go to Window -> Preference.
Then find Run/Debug in Left Menu.
Then open and choose Console.
Uncheck "Limit console output" or increase the limit as you want.
First of all i am sorry if i am wrong that the response is fake JSON ...
the api i am using is ticker api of unocoin
https://www.unocoin.com/trade?all
I have been working on a website which takes the rate from various indian bitcoin exchanges and plot the graphs for easy visualization.So far i have added 3 exchanges and got their rate from their TICKER API,the response i got is just plane text and no other surprises..
all these exchanges like
ZEBPAY: https://www.zebapi.com/api/v1/market/ticker/btc/inr
Koinex: https://koinex.in/api/ticker
made my life easier but
making a get request to unocoin api gives me a html page with only an iframe in body tag and i am not able to directly(or indirectly) use data in my code.
there is an alternate method to get access to many features but it requires me to register and feed my ACCESS TOKEN in every request which i don't prefer right now.
to make api calls i am using java and code is given belowe:
private static String sendGet(String host,String apiEndpoint) throws Exception {
URL obj = new URL(host+apiEndpoint);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
// optional default is GET
con.setRequestMethod("GET");
//add request header
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println(responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
return(response.toString());
}
just a note: i got google recaptcha if i make a lot of request in small time frame
the result from above code is
<html><head><META NAME="robots" CONTENT="noindex,nofollow"><script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05"></script><script>(function() { var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D313539373232303738303038363836383835372C31313637303136303238393537363439373530392C373430383533373634363033313237303235322C353332303936222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();</script></head><body><iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe></body></html>
i just want the response just like i get in my browser after visiting
https://www.unocoin.com/trade?all
The website is protected by an anti-scraping script called Incapsula that tries to run a small Javascript bit, but since you are using Java it won't be able to run it, unless you are using Selenium or like the V8 engine, but this is a bit not recommended because you are somehow breaking the rules of what they considered to be intrusive for them, but my recommendation:
Talk with the guys from unocoin.com and ask them to whitelist your IP if they are okay with you scraping their site.
Instead of using the API, you can do it by scraping the Unocoin Ticker API All Rates webpage. This would break if there is some change in the website, but till then it works.
It can be implemented via WebKit using WKWebView, WKNavigationDelegate protocol and then injecting some JavaScript.
import UIKit
import WebKit
class ViewController: UIViewController, WKNavigationDelegate {
#IBOutlet weak var webView: WKWebView!
override func viewDidLoad() {
super.viewDidLoad()
webView.isHidden = true
webView.navigationDelegate = self
let myURL = URL(string: "https://www.unocoin.com/trade?all")
let myRequest = URLRequest(url: myURL!)
webView.load(myRequest)
}
// For checking if website has loaded
func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
// Injecting JS to fetch HTML inside <body>
webView.evaluateJavaScript("document.body.innerHTML", completionHandler: {
(html: Any?, error: Error?) in
if error == nil && html != nil {
// Perform string manipulation and parse JSON to get data
} else {
// Error while fetching data
}
})
}
}
How can I get the latest tweet from html content through either regex or without any external libraries. I am happy to use external libraries I would just prefer not to. I just wanted to know how it would be possible. I have written the html download part in Java and if anyone wants I will post it here.
So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.
1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.) Append string->buffer
5.) If there is no more ->break
6.)print buffer
Obviously the variable buffer will now have raw html content. How can I sort this out to get the tweet. I have found a way but this is too inconsistent. The way I managed it was to find the string which held the tweets and get the content surrounded by the code. However there were too many changes in this section. What I mean is some content inside of it changes, like the font size. I could write multiple if statements but is there a neater solution?
Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library. You can use things like CSS selectors and whatnot. If you ever decide to use a library jsoup will make your life a lot easier.
You can just query for the element with the class of TweetTextSize, then get the text content. This will give you all text, hashtags, and links. (The downside being pictures are also given in links)
Otherwise, you'll need to manually traverse the DOM. For example, use regex to find the beginning of the first TweetTextSize, and then just keep all text which is not between a < and a >.
Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.
Simple answer if you want a regex and not a sophisticated third party library.
<p[^>]+js-tweet-text[^>]*>(.*)</p>
Try the above on the "view-source" of https://twitter.com/a
Thanks.
EDIT:
Source Code:
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TweetSucker {
public static void main(String[] args) throws Exception {
URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[8192];
int len = 0;
while ((len = inputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
String htmlContent = null;
if (encoding != null) {
htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
} else {
htmlContent = new String(byteArrayOutputStream.toByteArray());
}
Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
while (matcher.find()) {
System.out.println("Tweet Found: " + matcher.group(2));
}
}
}
I know that you don't want any libraries but if you want something really quick this is working code in C#:
using (IE browser = new IE())
{
browser.GoTo("https://twitter.com/user");
List tweets = browser.List(Find.ById("stream-items-id"));
if (tweets != null)
{
foreach (var tweet in tweets.ListItems)
{
var tweetText = tweet.Paras.FirstOrDefault();
if (tweetText != null)
{
MessageBox.Show(tweetText.Text);
}
}
}
}
This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:
using WatiN.Core;
You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets. Modify code accordingly to meet your needs.
I'm building a simple news readers app and I am using HTMLCleaner to retrieve and parse the data. I've sucessfully gotten the data I need using the commandline version of HTMLCleaner and using xmllint for example:
java -jar htmlcleaner-2.6.jar src=http://www.reuters.com/home nodebyxpath=//div[#id=\"topStory\"]
and
curl www.reuters.com | xmllint --html --xpath //div[#id='"topStory"'] -
both return the data I want. Then when I try to make this request using HTMLCleaner in my code I get no results. Even more troubling is that even basic queries like //div only return 8 nodes in my app while command line reports 70+ which is correct.
Here is the code I have now. It is in an Android class extending AsyncTask so its performed in the background. The final code will actually get the text data I need but I'm having trouble just getting it to return a result. When I Log Title Node the node count is zero.
I've tried every manner of escaping the xpath query strings but it makes no difference.
The HTMLCleaner code is in a separate source folder in my project and is (at least I think) compiled to dalvik with the rest of my app so an incompatible jar file shouldn't be the problem.
I've tried to dump the HTMLCleaner file but it doesn't work well with LogCat and alot of the page markup is missing when I dump it which made me think that HTMLCleaner was parsing incorrectly and discarding most of the page but how can that be the case when the commandline version works fine?
Also the app does not crash and I'm not logging any exceptions.
protected Void doInBackground(URL... argv) {
final HtmlCleaner cleaner = new HtmlCleaner();
TagNode lNode = null;
try {
lNode = cleaner.clean( argv[0].openConnection().getInputStream() );
Log.d("LoadMain", argv[0].toString());
} catch (IOException e) {
Log.d("LoadMain", e.getMessage());
}
final String lTitle = "//div[#id=\"topStory\"]";
// final String lBlurp = "//div[#id=\"topStory\"]//p";
try {
Object[] x = lNode.evaluateXPath(lTitle);
// Object[] y = lNode.evaluateXPath(lBlurp);
Log.d("LoadMain", "Title Nodes: " + x.length );
// Log.d("LoadMain", "Title Nodes: " + y.length);
// this.mBlurbs.add(new BlurbView (this.mContext, x.getText().toString(), y.getText().toString() ));
} catch (XPatherException e) {
Log.d("LoadMain", e.getMessage());
}
return null;
}
Any help is greatly appreciated. Thank you.
UPDATE:
I've narrowed down the problem to being something to do with the http request. If I load the html source as an asset I get what I want so clearly the problem is in receiving the http request. In other words using lNode = cleaner.clean( getAssets().open("reuters.html") ); works fine.
Problem was that the http request was being redirected to the mobile website. This was solved by changing the User-Agent property like so.
private static final String USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0";
HttpURLConnection lConn = (HttpURLConnection) argv[0].openConnection();
lConn.setRequestProperty("User-Agent", USER_AGENT);
lConn.connect();
lNode = cleaner.clean( lConn.getInputStream() );