How can i get all page Content?

How can i get all page Content? - java

I want get all page content of website Examp : http://academic.research.microsoft.com/Author/1789765/hoang-kiem?query=hoang%20kiem
I used this code:
String getResults(URL source) throws IOException {
InputStream in = source.openStream();
StringBuffer sb = new StringBuffer();
byte[] buffer = new byte[256];
while(true) {
int bytesRead = in.read(buffer);
if(bytesRead == -1) break;
for (int i=0; i<bytesRead; i++)
sb.append((char)buffer[i]);
}
return sb.toString();
}
But the result missing some information such as information some hints about the author as shown below
can you give me some advice ! Thanks

The author details are loaded by ajax calls (click the "Net" tab in firebug and reload the page). If you want to get these details you will have to load the page in an environment that will execute javascript (ie: a browser).

I am pretty sure these contents are loaded into the page per JavaScript, and there's not really anything you can do about that when retrieving the page text from Java. You'll probably want to get a browser-plugin instead (Firefox has the largest repository of addons).

Related

unable save image in jsp

I'm unable to save a Data URI in JSP. I am trying like this, is there any mistake in the following code?
<%# page import="java.awt.image.*,java.io.*,javax.imageio.*,sun.misc.*" %>
function save_photo()
{
Webcam.snap(function(data_uri)
{
document.getElementById('results').innerHTML =
'<h2>Here is your image:</h2>' + '<img src="'+data_uri+'"/>';
var dat = data_uri;
<%
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st); // here it going to displaying base64 chars
System.out.println("value=" + st); //but here it is going to displaying document.writeln(dat)
%>
}
}
Finally, the image is not saved.

I think you didn't get the difference between JSP and JavaScript. While JSP is executed on the Server at the time your browser requires the web page, JavaScript is executed at the Client side, so in your browser, when you do an interaction that causes the JavaScript to run.
You Server (eg Apache Tomcat) will firstly execute your JSP code:
String st = "document.writeln(dat)";
BufferedImage image = null;
byte[] imageByte;
BASE64Decoder decoder = new BASE64Decoder();
imageByte = decoder.decodeBuffer(st);
ByteArrayInputStream bis = new ByteArrayInputStream(imageByte);
image = ImageIO.read(bis);
bis.close();
if (image != null)
ImageIO.write(image, "jpg", new File("d://1.jpg"));
out.println("value=" + st);
System.out.println("value=" + st);
As you can see, nowhere is the value of st changed. Your broser will receive the following snippet from your server:
value=document.writeln(dat);
Since your browser is the one that executes JavaScript, he will execute it and show the Base64-encoded Image - but your server won't.
For the exact difference, read this article.
To make the code working, the easiest way is to redirect the page:
function(data_uri)
{
// redirect
document.location.href = 'saveImage.jsp?img='+data_uri;
}
Now, you can have a JSP-page called saveImage.jsp that saves the Image, and returns the webpage you had already, and write the dara_uri into the element results.
Another, but more difficult way is to use AJAX. Here is an introduction to it.

You are trying to use JavaScript variables in Java code. Java code is running on your server, while Javascript code runs in user's browser. By the time JavaScript code executes, your Java code has already been executed. Whatever you're trying to do, you have to do it in pure javascript, or send an AJAX call to your server when your Javascript code has done it's thing.

Parsing PDF that has been downloaded from internet

I have searched questions about this topic on stackoverflow. They really helped me but I stuck again.
My problem is that I need do write a method that downloads pdf from a site like (www.example.com/abc.pdf) and then I want to read the output. I don't want to save this file, just read in system out. I don't need to put bytes to fileoutputstream. I tried to cast bytes to char to get characters ( it can be dumbest solution ). But I got unknown characters. Any idea or am I understood it in a wrong way?
Here is the code and its output:
String textlink="http://www.selab.isti.cnr.it/ws-mate/example.pdf";// it comes from main class
public String HtmlTest(String textLink) throws IOException{
StringBuilder sd=new StringBuilder();
URL link=new URL(textLink);
URLConnection urlConn = link.openConnection();
BufferedInputStream in = null;
try
{
in = new BufferedInputStream(urlConn.getInputStream());
byte data[] = new byte[1024];
in.read(data, 0, 1024);
for (int j = 0; j < data.length; j++) {
if(j%100==0){
sd.append((char)data[j]+"\n"); // i used this for making readable text
}
else{
sd.append((char)data[j]);
}
}
}
finally
{
if (in != null)
in.close();
}
return sd.toString();
}
Output
run:
%
PDF-1.3
%ￇ￬ﾏﾢ
7 0 obj
<</Length 8 0 R/Filter /FlateDecode>>
stream
xﾜﾭY[ﾓￛﾶ￮ﾳ&?BoNf,,q%￠ﾼ4￞x&ﾞ6ﾩﾛlￓ
ﾗﾼ￐ﾽￋZeﾑ￲f￻￫￻ﾁ

You're not going to get very far trying to read a .pdf file as though it were basically a text file. For starters, the "text" is in a compressed binary format; there are other issues you'll probably also have to deal with.
STRONG SUGGESTION:
Use a Java .pdf library like Apache PDFBox
IMHO>.

Speeding up HTML extraction

I am using Java to get a chunk of HTML from a web page. Right now I am using a URLConnection with getInputStream() which is loading the whole page and taking a little longer than I would like. Is there anyway for it to load just the chunk i need or to exclude images or anything else that could speed it up. Any help is appreciated. Thank you.
Here is some code:
URL page = new URL("http://www.stackoverflow.com");
URLConnection connection = page.openConnection();
String html = getResponseData(connection);
public static String getResponseData(URLConncetion connection) {
StringBuffer sb = new StringBuffer();
InputStream is = connection.getInputStream();
int count;
while((count=is.read()) != -1){
sb.append((char)count);
}

I think you could try to find the actual data in that while loop, and abort as soon as you have found it.
Side note, your code will only load the HTML. Not the real images. They are not part of the response you get when requesting the page.
UPDATE: You could also buffer your inputstream. It could make the input faster. You can do this as follows
InputStream is = new BufferedInputStream(connection.getInputStream());

How to load a file across the network and handle it as a String

I would like to display the contents of the url in a JTextArea. I have a url that points to an XML file, I just want to display the contents of the file in JTextArea. how can I do this?

better JComponent for Html contents would be JEditorPane/JTextPane, then majority of WebSites should be displayed correctly there, or you can create own Html contents, but today Java6 supporting Html <=Html 3.2, lots of examples on this forum or here

You can do that way:
final URL myUrl= new URL("http://www.example.com/file.xml");
final InputStream in= myUrl.openStream();
final StringBuilder out = new StringBuilder();
final byte[] buffer = new byte[BUFFER_SIZE_WHY_NOT_1024];
try {
for (int ctr; (ctr = in.read(buffer)) != -1;) {
out.append(new String(buffer, 0, ctr));
}
} catch (IOException e) {
// you may want to handle the Exception. Here this is just an example:
throw new RuntimeException("Cannot convert stream to string", e);
}
final String yourFileAsAString = out.toString();
Then the content of your file is stored in the String called yourFileAsAString.
You can insert it in your JTextArea using JTextArea.insert(yourFileAsAString, pos) or append it using JTextArea.append(yourFileAsAString).
In this last case, you can directly append the readed text to the JTextArea instead of using a StringBuilder. To do so, just remove the StringBuilder from the code above and modify the for() loop the following way:
for (int ctr; (ctr = in.read(buffer)) != -1;) {
youJTextArea.append(new String(buffer, 0, ctr));
}

Assuming its HTTP URL
Open the HTTPURLConnection and read out the content

Using java.net.URL open resource as stream (method openStream()).
Load entire as String
place to your text area

UTF-8 Encoding in java, retrieving data from website

I'm trying to get data from website which is encoded in UTF-8 and insert them into the database (MYSQL). Database is also encoded in UTF-8.
This is the method I use to download data from specific site.
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
java.io.InputStreamReader r = null;
StringBuilder content = new StringBuilder();
try {
s = (java.io.InputStream)new URL(url).getContent();
r = new java.io.InputStreamReader(s, "UTF-8");
char[] buffer = new char[4*1024];
int n = 0;
while (n >= 0) {
n = r.read(buffer, 0, buffer.length);
if (n > 0) {
content.append(buffer, 0, n);
}
}
}
finally {
if (r != null) r.close();
if (s != null) s.close();
}
return content.toString();
}
If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
All my websites are encoded in UTF-8.
Please help.
If encoding is set to 'windows-1252' (r = new java.io.InputStreamReader(s, "windows-1252"); ) everything works fine and I am getting Côte d'Ivoire on my website (), but in java this title looks like 'C?´te d'Ivoire' what breaks other things, such as for example links. What does it mean ?

I would consider using commons-io, they have a function doing what you want to do:link
That is replace your code with this:
public String download(String url) throws java.io.IOException {
java.io.InputStream s = null;
String content = null;
try {
s = (java.io.InputStream)new URL(url).getContent();
content = IOUtils.toString(s, "UTF-8")
}
finally {
if (s != null) s.close();
}
return content.toString();
}
if that nots doing start looking into if you can store it to file correctly to eliminate the possibility that your db isn't set up correctly.

Java
The problem seems to lie in the HttpServletResponse , if you have a servlet or jsp page. Make sure to set your HttpServletResponse encoding to UTF-8.
In a jsp page or in the doGet or doPost of a servlet, before any content is sent to the response, just do :
response.setCharacterEncoding("UTF-8");
PHP
In PHP, try to use the utf8-encode function after retrieving from the database.

Is your database encoding set to UTF-8 for both server, client, connection and have the tables been created with that encoding? Check 'show variables' and 'show create table <one-of-the-tables>'

If encoding is set to 'UTF-8' (r = new java.io.InputStreamReader(s, "UTF-8"); ) data inserted into database seems to look OK, but when I try to display it, I am getting something like this: C�te d'Ivoire, instead of Côte d'Ivoire.
Thus, the encoding during the display is wrong. How are you displaying it? As per the comments, it's a PHP page? If so, then you need to take two things into account:
Write them to HTTP response output using the same encoding, thus UTF-8.
Set content type to UTF-8 so that the webbrowser knows which encoding to use to display text.
As per the comments, you have apparently already done 2. Left behind 1, in PHP you need to install mb_string and set mbstring.http_output to UTF-8 as well. I have found this cheatsheet very useful.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can i get all page Content? - java

The author details are loaded by ajax calls (click the "Net" tab in firebug and reload the page). If you want to get these details you will have to load the page in an environment that will execute javascript (ie: a browser).

I am pretty sure these contents are loaded into the page per JavaScript, and there's not really anything you can do about that when retrieving the page text from Java. You'll probably want to get a browser-plugin instead (Firefox has the largest repository of addons).

Related

unable save image in jsp

Parsing PDF that has been downloaded from internet

Speeding up HTML extraction

How to load a file across the network and handle it as a String

UTF-8 Encoding in java, retrieving data from website

Categories

Resources