Download the entire webpage - java

There are ways to download an entire webpage, using HTMLEditorKit. However, I need to download an entire webpage which needs scrolling in order to load its entire content. This technology is achieved most commonly through JavaScript bundled with Ajax.
Q.: Is there a way to trick the destined webpage, using only Java code, in order to download its full content?
Q.2: If this is not possible only with Java, then is it possible in combination with JavaScript?
Simple notice, what I wrote:
public class PageDownload {
public static void main(String[] args) throws Exception {
String webUrl = "...";
URL url = new URL(webUrl);
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG);
iterator.isValid(); iterator.next()) {
AttributeSet attributes = iterator.getAttributes();
String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC);
if (imgSrc != null && (imgSrc.endsWith(".jpg") || (imgSrc.endsWith(".jpeg"))
|| (imgSrc.endsWith(".png")) || (imgSrc.endsWith(".ico"))
|| (imgSrc.endsWith(".bmp")))) {
try {
downloadImage(webUrl, imgSrc);
} catch (IOException ex) {
System.out.println(ex.getMessage());
}
}
}
}
private static void downloadImage(String url, String imgSrc) throws IOException {
BufferedImage image = null;
try {
if (!(imgSrc.startsWith("http"))) {
url = url + imgSrc;
} else {
url = imgSrc;
}
imgSrc = imgSrc.substring(imgSrc.lastIndexOf("/") + 1);
String imageFormat = null;
imageFormat = imgSrc.substring(imgSrc.lastIndexOf(".") + 1);
String imgPath = null;
imgPath = "..." + imgSrc + "";
URL imageUrl = new URL(url);
image = ImageIO.read(imageUrl);
if (image != null) {
File file = new File(imgPath);
ImageIO.write(image, imageFormat, file);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}

Use HtmlUnit library to get all text and images/css files.
HTMLUnit [link] htmlunit.sourceforge.net
1) To download text content use code on below link s
all Text content [link] How to get a HTML page using HtmlUnit
Specific tag such as span [link] how to get text between a specific span with HtmlUnit
2) To get images/files use below [link] How can I tell HtmlUnit's WebClient to download images and css?

Yes you can trick a a webpage to download on your locals by Java code. You can not Download HTMl Static content by Java Script. JavaScript is not providing you to create a files as Java Provides.
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
public class HttpDownloadUtility {
private static final int BUFFER_SIZE = 4096;
/**
* Downloads a file from a URL
* #param fileURL HTTP URL of the file to be downloaded
* #param saveDir path of the directory to save the file
* #throws IOException
*/
public static void downloadFile(String fileURL, String saveDir)
throws IOException {
URL url = new URL(fileURL);
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
int responseCode = httpConn.getResponseCode();
// always check HTTP response code first
if (responseCode == HttpURLConnection.HTTP_OK) {
String fileName = "";
String disposition = httpConn.getHeaderField("Content-Disposition");
String contentType = httpConn.getContentType();
int contentLength = httpConn.getContentLength();
if (disposition != null) {
// extracts file name from header field
int index = disposition.indexOf("filename=");
if (index > 0) {
fileName = disposition.substring(index + 10,
disposition.length() - 1);
}
} else {
// extracts file name from URL
fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
fileURL.length());
}
System.out.println("Content-Type = " + contentType);
System.out.println("Content-Disposition = " + disposition);
System.out.println("Content-Length = " + contentLength);
System.out.println("fileName = " + fileName);
// opens input stream from the HTTP connection
InputStream inputStream = httpConn.getInputStream();
String saveFilePath = saveDir + File.separator + fileName;
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream(saveFilePath);
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
} else {
System.out.println("No file to download. Server replied HTTP code: " + responseCode);
}
httpConn.disconnect();
}
}

You can achieve this with Selenium Webdriver java classes...
https://code.google.com/p/selenium/wiki/GettingStarted
Generally, webdriver is used for testing, but it is able to emulate a user scrolling down the page, until the page stops changing, and then you can use java code to save the content to a file.

You can do it using IDM's grabber.
This should help:
https://www.internetdownloadmanager.com/support/idm-grabber/grabber_wizard.html

Related

how to read a pdf file online and save on local machine using java

Hi I was trying to read a PDF file online but after reading and writing on local. after viewing the document I am getting an error that content is not supported .
URL url1 =
new URL("http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf");
byte[] ba1 = new byte[1024];
int baLength;
FileOutputStream fos1 = new FileOutputStream("/mnt/linuxabc/research_paper/Gnostice_PathQuest.pdf");
try {
URLConnection urlConn = url1.openConnection();
/* if (!urlConn.getContentType().equalsIgnoreCase("application/pdf")) {
System.out.println("FAILED.\n[Sorry. This is not a PDF.]");
} else {*/
try {
InputStream is1 = url1.openStream();
while ((baLength = is1.read(ba1)) != -1) {
fos1.write(ba1, 0, baLength);
}
fos1.flush();
fos1.close();
is1.close();
} catch (ConnectException ce) {
System.out.println("FAILED.\n[" + ce.getMessage() + "]\n");
}
// }
Your Pdf Link actually redirects to https://www.gnostice.com/downloads.asp, so there is no pdf directly behind the link.
Try with another link: check first in a browser of your choice that invoking the pdf's url render a real pdf in the browser.
The code below is practically the same as yours except for the pdf's url and the output's path, and I am also adding exception throws to the main method's signature and simply printing the content type.
It works as expected:
public class PdfFileReader {
public static void main(String[] args) throws IOException {
URL pdfUrl = new URL("http://www.crdp-strasbourg.fr/je_lis_libre/livres/Anonyme_LesMilleEtUneNuits1.pdf");
byte[] ba1 = new byte[1024];
int baLength;
try (FileOutputStream fos1 = new FileOutputStream("c:\\mybook.pdf")) {
URLConnection urlConn = pdfUrl.openConnection();
System.out.println("The content type is: " + urlConn.getContentType());
try {
InputStream is1 = pdfUrl.openStream();
while ((baLength = is1.read(ba1)) != -1) {
fos1.write(ba1, 0, baLength);
}
fos1.flush();
fos1.close();
is1.close();
} catch (ConnectException ce) {
System.out.println("FAILED.\n[" + ce.getMessage() + "]\n");
}
}
}
}
Output:
The content type is: application/pdf
private static String readPdf() throws MalformedURLException, IOException {
URL url = new URL("https://colaboracion.dnp.gov.co/CDT/Sinergia/Documentos/Informe%20al%20Congreso%20Presidencia%202017_Baja_f.pdf");
BufferedReader read = new BufferedReader(
new InputStreamReader(url.openStream()));
String i;
StringBuilder stringBuilder = new StringBuilder();
while ((i = read.readLine()) != null) {
stringBuilder.append(i);
}
read.close();
return stringBuilder.toString();
}

How to convert HTML String to PDF using ConvertAPI (without a physical file)

Previously I was using http://do.convertapi.com/Web2Pdf to convert HTML String to PDF using a simple GET request (not POST) using Java. The entire content was passed using curl parameter.
However, that API seems to have stopped working recently. I'm trying to port over to https://v2.convertapi.com/web/to/pdf but I cannot find a sample to do the same using the new API either with GET or POST.
Can someone provide an example to make a GET or POST request using Java?
UPDATE: I have managed to make it work.
private static final String WEB2PDF_API_URL = "https://v2.convertapi.com/html/to/pdf";
private static final String WEB2PDF_SECRET = "secret-here";
String htmlContent = "valid HTML content here";
URL apiUrl = new URL(WEB2PDF_API_URL + "?secret=" + WEB2PDF_SECRET + "&download= attachment&PageOrientation=landscape&MarginLeft=0&MarginRight=0&MarginTop=0&MarginBottom=0");
HttpURLConnection connection = null;
ByteArrayOutputStream buffer = null;
connection = (HttpURLConnection) apiUrl.openConnection();
connection.setRequestProperty("Content-Disposition", "attachment; filename=\"data.html\"");
connection.setRequestProperty("Content-Type", "application/octet-stream");
connection.setRequestMethod("POST");
connection.setConnectTimeout(60000);
connection.setReadTimeout(60000);
connection.setDoOutput(true);
/* write request */
OutputStreamWriter writer = new OutputStreamWriter(connection.getOutputStream());
writer.write(htmlContent);
writer.flush();
writer.close();
/* read response */
String responseMessage = connection.getResponseMessage();
logger.info("responseMessage: " + responseMessage);
int statusCode = connection.getResponseCode();
logger.info("statusCode: " + statusCode);
if (statusCode == HttpURLConnection.HTTP_OK) {
logger.info("HTTP status code OK");
// parse output
InputStream is = connection.getInputStream();
buffer = new ByteArrayOutputStream();
int nRead;
byte[] data = new byte[16384];
while ((nRead = is.read(data, 0, data.length)) != -1) {
buffer.write(data, 0, nRead);
}
buffer.flush();
byte[] attachmentData = buffer.toByteArray();
Multipart content = new MimeMultipart();
...
MimeBodyPart attachment = new MimeBodyPart();
InputStream attachmentDataStream = new ByteArrayInputStream(attachmentData);
attachment.setFileName("filename-" + Long.toHexString(Double.doubleToLongBits(Math.random())) + ".pdf");
attachment.setContent(attachmentDataStream, "application/pdf");
content.addBodyPart(attachment);
...
}
You can easily push HTML string as file. I do not have JAVA example but the C# demo will give you right path.
using System;
using System.IO;
using System.Net.Http;
using System.Text;
class MainClass {
public static void Main (string[] args) {
var url = new Uri("https://v2.convertapi.com/html/to/pdf?download=attachment&secret=<YourSecret>");
var htmlString = "<!doctype html><html lang=en><head><meta charset=utf-8><title>ConvertAPI test</title></head><body>This page is generated from HTML string.</body></html>";
var content = new StringContent(htmlString, Encoding.UTF8, "application/octet-stream");
content.Headers.Add("Content-Disposition", "attachment; filename=\"data.html\"");
using (var resultFile = File.OpenWrite(#"C:\Path\to\result\file.pdf"))
{
new HttpClient().PostAsync(url, content).Result.Content.CopyToAsync(resultFile).Wait();
}
}
}

Downloaded files with whitespaces in path are damaged

I have following problem. I have to download pdf files from a server and some of them have whitespaces in their names. So every file will be downloaded, but those, which have whitespaces can not be opened.
If I access this files on the server via chrome, they open well (also with the whitespace in the url).
And what I am wondering about is, that java says the files will be downloaded. But when I try to open them in Acrobat Reader, it shows me an error message, that the files are damaged. Here is the sample of my code:
public static void downloadFile(String fileURL, String saveDir) throws IOException {
Authenticator.setDefault(new Authenticator() {
#Override
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("*****", "*********".toCharArray());
}
});
final int BUFFER_SIZE = 4096;
URL url = new URL(fileURL);
HttpURLConnection httpConn = (HttpURLConnection) url.openConnection();
String credentials = "ptt" + ":" + "ptt123";
String encoding = Base64.getEncoder().encodeToString(credentials.getBytes(StandardCharsets.UTF_8));
httpConn.setRequestProperty("Authorization", String.format("Basic %s", encoding));
int responseCode = 0;
responseCode = httpConn.getResponseCode();
// always check HTTP response code first
if (responseCode == HttpURLConnection.HTTP_OK) {
String fileName = "";
String disposition = httpConn.getHeaderField("Content-Disposition");
String contentType = httpConn.getContentType();
int contentLength = httpConn.getContentLength();
if (disposition != null) {
// extracts file name from header field
int index = disposition.indexOf("filename=");
if (index > 0) {
fileName = disposition.substring(index + 10,
disposition.length() - 1);
}
} else {
// extracts file name from URL
fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1,
fileURL.length());
}
// opens input stream from the HTTP connection
InputStream inputStream = httpConn.getInputStream();
String saveFilePath = saveDir + File.separator + fileName;
// opens an output stream to save into file
FileOutputStream outputStream = new FileOutputStream(saveFilePath);
int bytesRead = -1;
byte[] buffer = new byte[BUFFER_SIZE];
while ((bytesRead = inputStream.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
outputStream.close();
inputStream.close();
System.out.println("File downloaded");
} else {
System.out.println("No file to download. Server replied HTTP code: " + responseCode);
}
httpConn.disconnect();
}
I also tried to replace the whitespace through "%20" in the fileUrl.
So what can be the problem? As I wrote above, the files without any whitespace can be opened after the the download without any problems.
I use Java 1.7.
Cheers,
Andrej
if fileName contains space then replace it to some other charecter. it may work, if not please let me know.
if(fileName.trim().contains(" "))
fileName.replace(" ","_");
URL url = new URL(URLEncoder.encode(fileUrl, "UTF-8"));

Android, Download a .txt file and save internally

like the title states i am simply trying to download a test.txt file, the following url and save it internally, ideally within drawable.
i have been trying to modify this to work but will little success i keep getting "unable to download null" errors
int count;
try {
URL url = new URL("https://www.darkliteempire.gaming.multiplay.co.uk/testdownload.txt");
URLConnection conexion = url.openConnection();
conexion.connect();
int lenghtOfFile = conexion.getContentLength();
InputStream is = url.openStream();
File testDirectory = new File(Environment.getExternalStorageDirectory() + "/Download");
if (!testDirectory.exists()) {
testDirectory.mkdir();
}
FileOutputStream fos = new FileOutputStream(testDirectory + "/test.txt");
byte data[] = new byte[1024];
long total = 0;
int progress = 0;
while ((count = is.read(data)) != -1) {
total += count;
int progress_temp = (int) total * 100 / lenghtOfFile;
fos.write(data, 0, count);
}
is.close();
fos.close();
} catch (Exception e) {
Log.e("ERROR DOWNLOADING", "Unable to download" + e.getMessage());
}
There must be a simpler way to do this?
the file itself is tiny with perhaps 3 or 4 lines of text so i dont need anything fancy
Please Update your below code line and write valid url.
URL url = new URL("https://www.http://darkliteempire.gaming.multiplay.co.uk/testdownload.txt");
after write valid url your code line look like this.
URL url = new URL("http://www.darkliteempire.gaming.multiplay.co.uk/testdownload.txt");
it will solve your problem.
Using AQuery library you get something pretty straightforward. Plus you'll get hips of other cool functions to shorten your code.
http://code.google.com/p/android-query/wiki/AsyncAPI
String url = "https://picasaweb.google.com/data/feed/base/featured?max-results=16";
File ext = Environment.getExternalStorageDirectory();
File target = new File(ext, "aquery/myfolder/photos.xml");
aq.progress(R.id.progress).download(url, target, new AjaxCallback<File>(){
public void callback(String url, File file, AjaxStatus status) {
if(file != null){
showResult("File:" + file.length() + ":" + file, status);
}else{
showResult("Failed", status);
}
}
});

Facebook photo upload using HTTP request in Java

I'm trying to upload a photo to my profile via HttpURLConnection.
I have the acces token for user_status,user_photos,offline_access,publish_stream
Samples from my code:
url = new URL("https://graph.facebook.com/me/photos");
String content =
"access_token=" + URLEncoder.encode ("my_token") +
"&message=" + URLEncoder.encode ("SUNT !!!")+
"&url=" + URLEncoder.encode("file:///D:\\personale\\Images\\P0030_07-02-11_00.JPG");
When I make the request I got the following error
{"error":{"message":"file:\/\/\/D:\\personale\\Images\\P0030_07-02-11_00.JPG is an internal url, but this is an external request.","type":"CurlUrlInvalidException"}}
Can I upload files from my PC using the file URL ?
How can I upload files, using byte array from files ?
Many thanks !
The solution, as posted by dnp:
Many thanks !
This is the solution:
public class Main2 {
static final String BOUNDARY = "----------V2ymHFg03ehbqgZCaKO6jy";
public static void main(String [] args) throws IOException{
URL url;
HttpURLConnection urlConn;
DataOutputStream printout;
DataInputStream input;
//-------------------------------------------------------------------
File image = new File("D:/personale/Images/P1025[01]_03-07-11.JPG");
FileInputStream imgStream = new FileInputStream(image);
byte [] buffer = new byte[(int) image.length()];
imgStream.read(buffer);
//-------------------------------------------------------------------
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("10.18.1.1", 4444));
//url = new URL ("https://graph.facebook.com/me/feed");
url = new URL("https://graph.facebook.com/me/photos?access_token=AAACkMOZA41QEBACsafBxqVfXX54JqGLQSaE6YQ062NuTe3XUZBTdTEvy3R2H9Yr4PZA9r38JvLni7r1hYLuZCnBZAAPPH3krMMSKtIraiswZCiIZBu0nyYT");
System.out.println("Before Open Connection");
urlConn = (HttpURLConnection) url.openConnection(proxy);
urlConn.setRequestProperty("Content-Type", "multipart/form-data; boundary=" + getBoundaryString());
urlConn.setDoOutput (true);
urlConn.setUseCaches (false);
urlConn.setRequestMethod("POST");
// String content = "access_token=" + URLEncoder.encode ("AAACkMOZA41QEBAHQHUyYcMsLAewOYIe1j5dlOVOlMZBm6h9rvCQEFhmcBHg7ETHrdlrgv4sau573xMVuxIt8DzRxKFuqRqqBskDvOZA9iIkZCdPyI4Bu");
String boundary = getBoundaryString();
String boundaryMessage = getBoundaryMessage(boundary, "upload_field", "P1025[01]_03-07-11.JPG", "image/png");
String endBoundary = "\r\n--" + boundary + "--\r\n";
ByteArrayOutputStream bos = new ByteArrayOutputStream();
bos.write(boundaryMessage.getBytes());
bos.write(buffer);
bos.write(endBoundary.getBytes());
printout = new DataOutputStream (urlConn.getOutputStream ());
//printout.writeBytes(content);
printout.write(bos.toByteArray());
printout.flush ();
printout.close ();
// Get response data.
//input = new DataInputStream (urlConn.getInputStream ());
if (urlConn.getResponseCode() == 400 || urlConn.getResponseCode() == 500) {
input = new DataInputStream (urlConn.getErrorStream());
} else {
input = new DataInputStream (urlConn.getInputStream());
}
String str;
while (null != ((str = input.readLine())))
{
System.out.println (str);
}
input.close ();
}
public static String getBoundaryString()
{
return BOUNDARY;
}
public static String getBoundaryMessage(String boundary, String fileField, String fileName, String fileType)
{
StringBuffer res = new StringBuffer("--").append(boundary).append("\r\n");
res.append("Content-Disposition: form-data; name=\"").append(fileField).append("\"; filename=\"").append(fileName).append("\"\r\n")
.append("Content-Type: ").append(fileType).append("\r\n\r\n");
return res.toString();
}
}
You can use the following api to upload image on FB in default album
http://code.google.com/p/socialauth/
The latest release version 3.0 has this feature.

Categories