HTML Cleaner + XPath Not Working in Android App - java

I'm building a simple news readers app and I am using HTMLCleaner to retrieve and parse the data. I've sucessfully gotten the data I need using the commandline version of HTMLCleaner and using xmllint for example:
java -jar htmlcleaner-2.6.jar src=http://www.reuters.com/home nodebyxpath=//div[#id=\"topStory\"]
and
curl www.reuters.com | xmllint --html --xpath //div[#id='"topStory"'] -
both return the data I want. Then when I try to make this request using HTMLCleaner in my code I get no results. Even more troubling is that even basic queries like //div only return 8 nodes in my app while command line reports 70+ which is correct.
Here is the code I have now. It is in an Android class extending AsyncTask so its performed in the background. The final code will actually get the text data I need but I'm having trouble just getting it to return a result. When I Log Title Node the node count is zero.
I've tried every manner of escaping the xpath query strings but it makes no difference.
The HTMLCleaner code is in a separate source folder in my project and is (at least I think) compiled to dalvik with the rest of my app so an incompatible jar file shouldn't be the problem.
I've tried to dump the HTMLCleaner file but it doesn't work well with LogCat and alot of the page markup is missing when I dump it which made me think that HTMLCleaner was parsing incorrectly and discarding most of the page but how can that be the case when the commandline version works fine?
Also the app does not crash and I'm not logging any exceptions.
protected Void doInBackground(URL... argv) {
final HtmlCleaner cleaner = new HtmlCleaner();
TagNode lNode = null;
try {
lNode = cleaner.clean( argv[0].openConnection().getInputStream() );
Log.d("LoadMain", argv[0].toString());
} catch (IOException e) {
Log.d("LoadMain", e.getMessage());
}
final String lTitle = "//div[#id=\"topStory\"]";
// final String lBlurp = "//div[#id=\"topStory\"]//p";
try {
Object[] x = lNode.evaluateXPath(lTitle);
// Object[] y = lNode.evaluateXPath(lBlurp);
Log.d("LoadMain", "Title Nodes: " + x.length );
// Log.d("LoadMain", "Title Nodes: " + y.length);
// this.mBlurbs.add(new BlurbView (this.mContext, x.getText().toString(), y.getText().toString() ));
} catch (XPatherException e) {
Log.d("LoadMain", e.getMessage());
}
return null;
}
Any help is greatly appreciated. Thank you.
UPDATE:
I've narrowed down the problem to being something to do with the http request. If I load the html source as an asset I get what I want so clearly the problem is in receiving the http request. In other words using lNode = cleaner.clean( getAssets().open("reuters.html") ); works fine.

Problem was that the http request was being redirected to the mobile website. This was solved by changing the User-Agent property like so.
private static final String USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0";
HttpURLConnection lConn = (HttpURLConnection) argv[0].openConnection();
lConn.setRequestProperty("User-Agent", USER_AGENT);
lConn.connect();
lNode = cleaner.clean( lConn.getInputStream() );

Related

how to read FAKE JSON response from ticker API of unocoin bitcoin exchange?

First of all i am sorry if i am wrong that the response is fake JSON ...
the api i am using is ticker api of unocoin
https://www.unocoin.com/trade?all
I have been working on a website which takes the rate from various indian bitcoin exchanges and plot the graphs for easy visualization.So far i have added 3 exchanges and got their rate from their TICKER API,the response i got is just plane text and no other surprises..
all these exchanges like
ZEBPAY: https://www.zebapi.com/api/v1/market/ticker/btc/inr
Koinex: https://koinex.in/api/ticker
made my life easier but
making a get request to unocoin api gives me a html page with only an iframe in body tag and i am not able to directly(or indirectly) use data in my code.
there is an alternate method to get access to many features but it requires me to register and feed my ACCESS TOKEN in every request which i don't prefer right now.
to make api calls i am using java and code is given belowe:
private static String sendGet(String host,String apiEndpoint) throws Exception {
URL obj = new URL(host+apiEndpoint);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
// optional default is GET
con.setRequestMethod("GET");
//add request header
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println(responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
return(response.toString());
}
just a note: i got google recaptcha if i make a lot of request in small time frame
the result from above code is
<html><head><META NAME="robots" CONTENT="noindex,nofollow"><script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05"></script><script>(function() { var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D313539373232303738303038363836383835372C31313637303136303238393537363439373530392C373430383533373634363033313237303235322C353332303936222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();</script></head><body><iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe></body></html>
i just want the response just like i get in my browser after visiting
https://www.unocoin.com/trade?all
The website is protected by an anti-scraping script called Incapsula that tries to run a small Javascript bit, but since you are using Java it won't be able to run it, unless you are using Selenium or like the V8 engine, but this is a bit not recommended because you are somehow breaking the rules of what they considered to be intrusive for them, but my recommendation:
Talk with the guys from unocoin.com and ask them to whitelist your IP if they are okay with you scraping their site.
Instead of using the API, you can do it by scraping the Unocoin Ticker API All Rates webpage. This would break if there is some change in the website, but till then it works.
It can be implemented via WebKit using WKWebView, WKNavigationDelegate protocol and then injecting some JavaScript.
import UIKit
import WebKit
class ViewController: UIViewController, WKNavigationDelegate {
#IBOutlet weak var webView: WKWebView!
override func viewDidLoad() {
super.viewDidLoad()
webView.isHidden = true
webView.navigationDelegate = self
let myURL = URL(string: "https://www.unocoin.com/trade?all")
let myRequest = URLRequest(url: myURL!)
webView.load(myRequest)
}
// For checking if website has loaded
func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
// Injecting JS to fetch HTML inside <body>
webView.evaluateJavaScript("document.body.innerHTML", completionHandler: {
(html: Any?, error: Error?) in
if error == nil && html != nil {
// Perform string manipulation and parse JSON to get data
} else {
// Error while fetching data
}
})
}
}

check for validity of URL in java. so as not to crash on 404 error

Essentially, like a bulletproof tank, i want my program to absord 404 errors and keep on rolling, crushing the interwebs and leaving corpses dead and bludied in its wake, or, w/e.
I keep getting this error:
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:29)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:38)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Runner.main(Runner.java:35)
But I can't understand why because I am checking to see if I have a valid URL before I navigate to it. What about my checking procedure is incorrect?
I tried to examine the other stack overflow questions on this subject but they're not very authoritative, plus I implemented the many of the solutions from this one and this one, so far nothing has worked.
I'm using the apache commons URL validator, this is the code I've been using most recently:
//get it's normal wiki disambig page
String URL_check = "https://en.wikipedia.org/wiki/" + associated_alias;
UrlValidator urlValidator = new UrlValidator();
if ( urlValidator.isValid( URL_check ) )
{
Document docx = Jsoup.connect( URL_check ).get();
//this can handle the less structured ones.
and
//check the validity of the URL
String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";
UrlValidator urlValidator = new UrlValidator();
if ( urlValidator.isValid( URL_czech ) )
{
URL wikidata_page = new URL( URL_czech );
URLConnection wiki_connection = wikidata_page.openConnection();
BufferedReader wiki_data_pagecontent = new BufferedReader(
new InputStreamReader(
wiki_connection.getInputStream()));
The URLConnection throws an error when the status code of the webpage your downloading returns anything other than 2xx (such as 200 or 201 ect...). Instead of passing Jsoup a URL or String to parse your document consider passing it an input stream of data which contains the webpage.
Using the HttpURLConnection class we can try to download the webpage using getInputStream() and place that in a try/catch block and if it fails attempt to download it via getErrorStream().
Consider this bit of code which will download your wiki page even if it returns 404
String URL_czech = "https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29";
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try {
// try to connect and use the input stream
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
} catch(IOException e) {
// failed, try using the error stream
wikiInputStream = wiki_connection.getErrorStream();
}
// parse the input stream using Jsoup
Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");
The Status=404 error means there's no page at that location. Just because a URL is valid doesn't mean there's anything there. A validator can't tell you that. The only way you can determine that is by fetching it, and seeing if you get an error, as you're doing.

Java Applet's URLConnection to PHP has no effect

I've studied up on the Oracle documentation and examples and still can't get this to work.
I have a Java Applet that is simply trying to send a text field to a PHP script via POST, using URLConnection and OutputStreamWriter. The Java side seems to work fine, no exceptions are thrown, but PHP is not showing any output on my page. I am a PHP noob so please bear with me on that part.
Here is the relevant Java portion:
try {
URL url = new URL("myphpfile.php");
URLConnection con = url.openConnection();
con.setDoOutput(true);
out = new OutputStreamWriter(con.getOutputStream());
String outstring = "field1=" + field1 + "&field2=" + field2;
out.write(outstring);
out.close();
}
catch (Exception e) {
System.out.println("HTTPConnection error: " + e);
return;
}
and here is the relevant PHP code:
<?php
$field1= $_POST['field1'];
$field2= $_POST['field2'];
print "<table><tr><th>Column1</th><th>Column2</th></tr><tr><td>" .
$field1 . "</td><td>" . $field2 . "</td></tr></table>";
?>
All I see are the table headers Column1 and Column2 (let's just keep these names generic for testing purposes). What am I doing wrong? Do I need to tell my PHP script to check when my Java code does the write?
Not USE $_POST ,USE $_REQUEST OR $_GET
WHERE TO SET $field1 and $field2 in your php script?
Try URL url = new URL("myphpfile.php?field1=" + field1 + "&field2=" + field2);
Well, I feel like I've tried every possible thing that can be tried with PHP, so I eventually went with JSObject. Now THAT was easy.
Working Java code:
JSObject window = JSObject.getWindow(this);
// invoke JavaScript function
String result = "<table><tr><th>Column1</th><th>Column2</th></tr><tr><td>"
+ field1 + "</td><td>" + field2 + "</td></tr></table>";
window.call("writeResult", new Object[] {result});
Relevant working Javascript:
function writeResult(result) {
var resultElem =
document.getElementById("anHTMLtagID");
resultElem.innerHTML = result;
}
From here I can even send the results from Javascript to PHP via Ajax to do database-related actions. Yay!

Java program to download images from a website and display the file sizes

I'm creating a java program that will read a html document from a URL and display the sizes of the images in the code. I'm not sure how to go about achieving this though.
I wouldn't need to actually download and save the images, i just need the sizes and the order in which they appear on the webpage.
for example:
a webpage has 3 images
<img src="dog.jpg" /> //which is 54kb
<img src="cat.jpg" /> //which is 75kb
<img src="horse.jpg"/> //which is 80kb
i would need the output of my java program to display
54kb
75kb
80kb
Any ideas where i should start?
p.s I'm a bit of a java newbie
If you're new to Java you may want to leverage an existing library to make things a bit easier. Jsoup allows you to fetch an HTML page and extract elements using CSS-style selectors.
This is just a quick and very dirty example but I think it will show how easy Jsoup can make such a task. Please note that error handling and response-code handling was omitted, I merely wanted to pass on the general idea:
Document doc = Jsoup.connect("http://stackoverflow.com/questions/14541740/java-program-to-download-images-from-a-website-and-display-the-file-sizes").get();
Elements imgElements = doc.select("img[src]");
Map<String, String> fileSizeMap = new HashMap<String, String>();
for(Element imgElement : imgElements){
String imgUrlString = imgElement.attr("abs:src");
URL imgURL = new URL(imgUrlString);
HttpURLConnection httpConnection = (HttpURLConnection) imgURL.openConnection();
String contentLengthString = httpConnection.getHeaderField("Content-Length");
if(contentLengthString == null)
contentLengthString = "Unknown";
fileSizeMap.put(imgUrlString, contentLengthString);
}
for(Map.Entry<String, String> mapEntry : fileSizeMap.entrySet()){
String imgFileName = mapEntry.getKey();
System.out.println(imgFileName + " ---> " + mapEntry.getValue() + " bytes");
}
You might also consider looking at Apache HttpClient. I find it generally preferable over the raw URLConnection/HttpURLConnection approach.
You should break you problem into 3 sub problems
Download the HTML document
Parse the HTML document and find the images
Download the images and determine its size
You can use regular expressions to find tag and get image URL. After that you'll need and HttpUrlConnection class to get image data and measure it's size.
You can do this:
try {
URL urlConn = new URL("http://yoururl.com/cat.jpg");
URLConnection urlC = urlConn.openConnection();
System.out.println(urlC.getContentLength());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

Android: Extracting the text between two HTML tags

I need to extract the text between two HTML tags and store it in a string. An example of the HTML I want to parse is as follows:
<div id=\"swiki.2.1\"> THE TEXT I NEED </div>
I have done this in Java using the pattern (swiki\.2\.1\\\")(.*)(\/div) and getting the string I want from the group $2. However this will not work in android. When I go to print the contents of $2 nothing appears, because the match fails.
Has anyone had a similar problem with using regex in android, or is there a better way (non-regex) to parse the HTML page in the first place. Again, this works fine in a standard java test program. Any help would be greatly appreciated!
For HTML-parsing-stuff I always use HtmlCleaner: http://htmlcleaner.sourceforge.net/
Awesome lib that works great with Xpath and of course Android. :-)
This shows how you can download an XML from URL and parse it to get a certain value from an XML attribute (also shown in the docs):
public static String snapFromHtmlWithCookies(Context context, String xPath, String attrToSnap, String urlString,
String cookies) throws IOException, XPatherException {
String snap = "";
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
// optional cookies
connection.setRequestProperty(context.getString(R.string.cookie_prefix), cookies);
connection.connect();
// use the cleaner to "clean" the HTML and return it as a TagNode object
TagNode root = cleaner.clean(new InputStreamReader(connection.getInputStream()));
Object[] foundNodes = root.evaluateXPath(xPath);
if (foundNodes.length > 0) {
TagNode foundNode = (TagNode) foundNodes[0];
snap = foundNode.getAttributeByName(attrToSnap);
}
return snap;
}
Just edit it for your needs. :-)

Categories