How i can download webpage which uses java based loading mechanism?
Code below returns nearly empty document due site mechanism.
When viewed in browser you see "loading..." and after a while content is presented.
Also i want to avoid using WebBrowser control.
HtmlDocument doc = new HtmlDocument();
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
if (!string.IsNullOrWhiteSpace(userAgent))
req.UserAgent = userAgent;
if (cookies != null)
{
req.CookieContainer = new CookieContainer();
foreach (Cookie c in cookies)
req.CookieContainer.Add(c);
}
var resp = req.GetResponse();
var resp_str = resp.GetResponseStream();
using (StreamReader sr = new StreamReader(resp_str, Encoding.GetEncoding("windows-1251")))
{
string r = sr.ReadToEnd();
doc.LoadHtml(r);
}
return doc;
Well you basically need a web browser to do the javascript running. Your webrequest now only fetches the data, as is, from the server.
You could use System.Windows.Forms.WebBrowser but its not pretty. This https://stackoverflow.com/a/11394830/2940949 might give you some idea on the basic issue.
Related
I'm a newbie to HtmlUnit, and I'm writing a demo script to load the source HTML of a webpage and write it to a txt file.
public static void main(String[] args) throws IOException {
try (final WebClient wc = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
wc.getOptions().setThrowExceptionOnScriptError(false);
final HtmlPage page = wc.getPage("https://www.sainsburys.co.uk/gol-ui/SearchResults/biscuits");
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
FileWriter fw = new FileWriter(dir + "/pageHtml.txt");
fw.write(html);
fw.close();
}
}
However, it returns the HTML for disabled JavaScript. To try and fix this, I added this line to ensure JS is enabled on the WebClient:
wc.getOptions().setJavaScriptEnabled(true);
Despite that, nothing changes. Am I being an idiot, or is there something more subtle that needs to change?
Thanks for any help! ^_^
WebResponse res = page.getWebResponse();
String html = res.getContentAsString();
This is the response (code) you got from the server. If you like to have the current DOM (the one after the js processing is done you can do something like
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(60_000);
System.out.println(page.asXml());
or
System.out.println(page.asNormalizedText());
First of all i am sorry if i am wrong that the response is fake JSON ...
the api i am using is ticker api of unocoin
https://www.unocoin.com/trade?all
I have been working on a website which takes the rate from various indian bitcoin exchanges and plot the graphs for easy visualization.So far i have added 3 exchanges and got their rate from their TICKER API,the response i got is just plane text and no other surprises..
all these exchanges like
ZEBPAY: https://www.zebapi.com/api/v1/market/ticker/btc/inr
Koinex: https://koinex.in/api/ticker
made my life easier but
making a get request to unocoin api gives me a html page with only an iframe in body tag and i am not able to directly(or indirectly) use data in my code.
there is an alternate method to get access to many features but it requires me to register and feed my ACCESS TOKEN in every request which i don't prefer right now.
to make api calls i am using java and code is given belowe:
private static String sendGet(String host,String apiEndpoint) throws Exception {
URL obj = new URL(host+apiEndpoint);
HttpURLConnection con = (HttpURLConnection) obj.openConnection();
// optional default is GET
con.setRequestMethod("GET");
//add request header
con.setRequestProperty("User-Agent", USER_AGENT);
int responseCode = con.getResponseCode();
System.out.println(responseCode);
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
StringBuffer response = new StringBuffer();
while ((inputLine = in.readLine()) != null) {
response.append(inputLine);
}
in.close();
return(response.toString());
}
just a note: i got google recaptcha if i make a lot of request in small time frame
the result from above code is
<html><head><META NAME="robots" CONTENT="noindex,nofollow"><script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05"></script><script>(function() { var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D313539373232303738303038363836383835372C31313637303136303238393537363439373530392C373430383533373634363033313237303235322C353332303936222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();</script></head><body><iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe></body></html>
i just want the response just like i get in my browser after visiting
https://www.unocoin.com/trade?all
The website is protected by an anti-scraping script called Incapsula that tries to run a small Javascript bit, but since you are using Java it won't be able to run it, unless you are using Selenium or like the V8 engine, but this is a bit not recommended because you are somehow breaking the rules of what they considered to be intrusive for them, but my recommendation:
Talk with the guys from unocoin.com and ask them to whitelist your IP if they are okay with you scraping their site.
Instead of using the API, you can do it by scraping the Unocoin Ticker API All Rates webpage. This would break if there is some change in the website, but till then it works.
It can be implemented via WebKit using WKWebView, WKNavigationDelegate protocol and then injecting some JavaScript.
import UIKit
import WebKit
class ViewController: UIViewController, WKNavigationDelegate {
#IBOutlet weak var webView: WKWebView!
override func viewDidLoad() {
super.viewDidLoad()
webView.isHidden = true
webView.navigationDelegate = self
let myURL = URL(string: "https://www.unocoin.com/trade?all")
let myRequest = URLRequest(url: myURL!)
webView.load(myRequest)
}
// For checking if website has loaded
func webView(_ webView: WKWebView, didFinish navigation: WKNavigation!) {
// Injecting JS to fetch HTML inside <body>
webView.evaluateJavaScript("document.body.innerHTML", completionHandler: {
(html: Any?, error: Error?) in
if error == nil && html != nil {
// Perform string manipulation and parse JSON to get data
} else {
// Error while fetching data
}
})
}
}
I'm unable to save a Data URI in JSP. I am trying like this, is there any mistake in the following code?
<%# page import="java.awt.image.*,java.io.*,javax.imageio.*,sun.misc.*" %>
function save_photo()
{
Webcam.snap(function(data_uri)
{
document.getElementById('results').innerHTML =
'<h2>Here is your image:</h2>' + '<img src="'+data_uri+'"/>';
var dat = data_uri;
<%
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st); // here it going to displaying base64 chars
System.out.println("value=" + st); //but here it is going to displaying document.writeln(dat)
%>
}
}
Finally, the image is not saved.
I think you didn't get the difference between JSP and JavaScript. While JSP is executed on the Server at the time your browser requires the web page, JavaScript is executed at the Client side, so in your browser, when you do an interaction that causes the JavaScript to run.
You Server (eg Apache Tomcat) will firstly execute your JSP code:
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st);
System.out.println("value=" + st);
As you can see, nowhere is the value of st changed. Your broser will receive the following snippet from your server:
value=document.writeln(dat);
Since your browser is the one that executes JavaScript, he will execute it and show the Base64-encoded Image - but your server won't.
For the exact difference, read this article.
To make the code working, the easiest way is to redirect the page:
function(data_uri)
{
// redirect
document.location.href = 'saveImage.jsp?img='+data_uri;
}
Now, you can have a JSP-page called saveImage.jsp that saves the Image, and returns the webpage you had already, and write the dara_uri into the element results.
Another, but more difficult way is to use AJAX. Here is an introduction to it.
You are trying to use JavaScript variables in Java code. Java code is running on your server, while Javascript code runs in user's browser. By the time JavaScript code executes, your Java code has already been executed. Whatever you're trying to do, you have to do it in pure javascript, or send an AJAX call to your server when your Javascript code has done it's thing.
I am using a JTextPane to display data from a webpage that isn't mine, so I have no control over its contents. It requires a user to be logged in, so I use URLConnections to connect to that webpage and use cookies in the URLConnection to retrieve data. That works fine. However, when I put this data in a JTextPane with the content type set to text/html, the images do not display as they require those cookies with the session id and stuff to be sent in order to retrieve the uploaded images.
Is there any way I can make the JTextPane (though I am able to use anything else in the jdk that displays html) use my cookies?
Thanks.
I store the cookies in a linked list:
loadText = "Logging in...";
url = new URL("http://www.example.com/login.php");
connection = url.openConnection();
connection.setDoOutput(true);
OutputStreamWriter out = new OutputStreamWriter(
connection.getOutputStream());
out.write("username=" + URLEncoder.encode(username, "UTF-8")
+ "&password=" + URLEncoder.encode(password, "UTF-8")
+ "&testcookies=1");
out.flush();
out.close();
List<String> cookies = new LinkedList<String>();
for (int i = 1; (headerName = connection.getHeaderFieldKey(i)) != null; i++) {
if (headerName.equals("Set-Cookie")) {
String cookie = connection.getHeaderField(i);
cookie = cookie.substring(0, cookie.indexOf(";"));
cookies.add(cookie);
}
}
And I also need to strip away unneccesary HTML, which gives me a string I plug into the textpane:
String p1 = rawPage.split("<div id=\"contentstart\">")[1]
.split("</div><!--id='contentstart'-->")[0];
p1 = p1.replaceAll("<p><strong></strong></p>", "");
p1 = p1.replaceAll("<p></p>", "");
parsed = true;
JTextPane tp = new JTextPane();
tp.setEditable(false);
JScrollPane js = new JScrollPane();
js.getViewport().add(tp);
js.setHorizontalScrollBarPolicy(ScrollPaneConstants.HORIZONTAL_SCROLLBAR_NEVER);
getContentPane().add(js);
js.setSize(640, 480);
tp.setContentType("text/html");
tp.setText(p1);
Are you not reading the content from URLConnection? Something like this may help.
Post your code so that we can get more insight.
JTextPane pane;
..
HTMLDocument htmlDocument = (HTMLDocument) pane.getDocument();
htmlDocument.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
htmlDocument.putProperty(Document.StreamDescriptionProperty, pageUrl);
pane.read(connection.getInputStream, htmlDocument);
-- or --
You may try the browser swing component instead of JTextPane.
http://djproject.sourceforge.net/ns/index.html
Cookies are stored in relation to your browser. For example, if you have some cookies in Firefox, Microsoft IE can't see those cookies. Similarly, the cookies you have obtained from the webpage you're looking for are not available to your Java application.
But also, JTextPane is not a full-featured HTML browser. You can use it to render basic HTML (actually HTML 2.0, a much older version of HTML), but it won't work with cookies, CSS, and other now-standard web features.
You may want to look at full-featured web browsers, such as Flying Saucer - see http://weblogs.java.net/blog/2007/07/14/flying-saucer-r7-out
But even if you do this, Flying Saucer won't see the cookies that you've obtained through other browsers.
I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?
I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.
Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.
Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.