Fetch ajax/javascript content using HTMLunit - java

I have written a code which will fetch me a html contents of the page as response , I am using HTML Unit to do so . But I am getting error's for some specific urls like
[https://communities.netapp.com/welcome][1]
For first page i am able to retrieve the contents . But when i dont the content which we get using load more button .
Here's my code:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Sample {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
String url = "https://communities.netapp.com/welcome";
WebClient client = new WebClient(BrowserVersion.INTERNET_EXPLORER_9);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = client.getPage(url);
Writer output = null;
String text = page.asText();
File file = new File("D://write6.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(text);
output.close();
System.out.println("Your file has been written");
// System.out.println("as Text ==" +page.asText());
// System.out.println("asXML == " +page.asXml());
// System.out.println("text content ==" +page.getTextContent());
// System.out.println(page.getWebResponse().getContentAsString());
}
}
Any suggestion ?

As i understand from your question you have a button on which you have to press.
Please look at: http://htmlunit.sourceforge.net/gettingStarted.html
You have there an example of submitting a form.
This should be very similar here

Related

Read PDF from a URL using Selenium-WebDriver and PDF-Box

I'm trying to read the text from a PDF using Selenium-web driver and the PDFbox API. If possible I don't want to download the file, but only read the PDF from the web getting only the text of PDF into a string. The code I'm using its below, can't make to work though:
I've found examples of code to download the PDF and comparing it using the file downloaded, but none functional example extracting the text of the PDF from the URL.
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class PDFextract {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
WebDriver driver=new ChromeDriver();
driver.manage().window().maximize();
driver.get("THE URL OF SITE I CANT SHARE"); //THE URL OF SITE I CAN'T SHARE
System.out.println(driver.getTitle());
List<WebElement> list = driver.findElements(By.xpath("//a[#title='Click to open file']"));
int rows = list.size();
for (int i= 1; i <= rows; i++) {
}
List<WebElement> links = driver.findElements(By.xpath("//a[#title='Click to open file']"));
String fLinks = "";
for (WebElement link : links) {
fLinks = fLink + link.getAttribute("href");
}
fLinks = fLinks.trim();
System.out.println(fLinks); // till here the code works fine.. i get a valid url link
// the code bellow doesn't work
URL url=new URL(fLinks);
HttpURLConnection connection=(HttpURLConnection)url.openConnection();
InputStream is=connection.getInputStream();
PDDocument pdd=PDDocument.load(is);
PDFTextStripper stripper=new PDFTextStripper();
String text=stripper.getText(pdd);
pdd.close();
is.close();
System.out.println(text);
I get the error:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at PDFextract.main(PDFextract.java:106)
Edited in 07.05.2020:
#TilmanHausherr, I've done more research, this helped out in the first part, how to read a PDF from a link: Selenium Tutorial: Read PDF Content using Selenium WebDriver
This method works:
String pdfContent = readPDFContent(driver.getCurrentUrl());
public String readPDFContent(String appUrl) throws Exception {
URL url = new URL(appUrl);
InputStream is = url.openStream();
BufferedInputStream fileToParse = new BufferedInputStream(is);
PDDocument document = null;
String output = null;
try {
document = PDDocument.load(fileToParse);
output = new PDFTextStripper().getText(document);
System.out.println(output);
} finally {
if (document != null) {
document.close();
}
fileToParse.close();
is.close();
}
return output;
}
It seems my problem its the link itself, the HTML element its '< embed >', in my case there is also a 'stream-URL':
<embed id="plugin" type="application/x-google-chrome-pdf"
src="https://"SITE
I CAN'T TELL"/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a"
stream-URL="chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1" headers="accept-ranges: bytes
content-disposition: inline;filename="online.pdf"
content-length: 71488
content-security-policy: frame-ancestors 'self' https://*"SITE I CAN'T TELL"
https://*"DOMAIN I CAN'T TELL".net
content-type: application/pdf
Found this: 1. Download the File which has stream-url is the chrome extension in the embed tag using selenium
2. Handling contents of Embed tag in selenium python
But I still didn't manage to read the PDF with PDFbox because the element its '< embed>' and i might have to access the stream-URL.

How to send an image with JDA

im developing my Java bot for discord. And I want to send an image. I tried using TextChannel.sendFile(File, Message), but it`s not that result that I want to get. I want this file to be displayed like a normal image.
The imports:
import java.io.File;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ThreadLocalRandom;
import javax.xml.namespace.QName;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import org.apache.commons.io.FileUtils;
import net.dv8tion.jda.core.MessageBuilder;
import net.dv8tion.jda.core.entities.Message;
import net.dv8tion.jda.core.entities.TextChannel;
import net.dv8tion.jda.core.events.message.MessageReceivedEvent;
And the other code:
URL url = new URL(s.toString());
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(conn.getInputStream());
final List<String> files = new ArrayList<>();
while (reader.hasNext()) {
XMLEvent e = reader.nextEvent();
if (e.isStartElement()) {
StartElement se = e.asStartElement();
if (se.getName().getLocalPart().equals("post")) {
Attribute purl = se.getAttributeByName(new QName("file_url"));
files.add(purl.getValue());
}
}
}
int rid = ThreadLocalRandom.current().nextInt(files.size() - 1);
String p = files.get(rid);
files.clear();
URL u = new URL(p);
final String[] dots = p.split("\\.");
final String format = dots[dots.length - 1];
File f = new File("its not a porn." + format);
FileUtils.copyURLToFile(url, f);
Message m = new MessageBuilder().append("okay :)").build();
c.sendFile(f, m).queue();
}
I tried to find a solution somewhere but i haven found any info that could help.
At JDA 4.2.0_168
the message on sendFile() is the name of the file that you are sending to the discord servers, so it needs an extension
example:
File f = new File("image.png");
TextChannel.sendFile(f, "image.png").queue();
if you want comments in the message
File f = new File("image.png");
//the name doesn't need to be the same, just the same extension
TextChannel.sendFile(f, "another_name.png").append("okay :)").queue();
Result of last code
Reading through the docs, you need to create a MessageEmbed and add it to the message using m.setEmbed(..)

How to convert HTML to PDF with working hyperlinks using docx4j?

I used Eclipse Luna 64bit, Maven, docx4j API for PDF conversion, template letter format on which I want my HTML code. This template is saved in my database.
I want to include a hyperlink in the PDF, so my users can on click this link and open it in their browser.
This is my main class:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Properties;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.faces.bean.ManagedBean;
import javax.faces.bean.ManagedProperty;
import javax.faces.bean.ViewScoped;
import javax.faces.model.SelectItem;
import org.apache.commons.lang.StringUtils;
import org.docx4j.Docx4J;
import org.docx4j.XmlUtils;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.jaxb.Context;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.relationships.Namespaces;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.Body;
import org.docx4j.wml.BooleanDefaultTrue;
import org.docx4j.wml.Document;
import org.docx4j.wml.P;
import org.docx4j.wml.PPrBase;
import org.docx4j.wml.R;
import org.docx4j.wml.Text;
import org.primefaces.context.RequestContext;
import org.primefaces.model.DefaultStreamedContent;
import org.primefaces.model.StreamedContent;
import org.primefaces.model.UploadedFile;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class LetterMaintenanceBean extends BaseManagedBean implements
Serializable {
public StreamedContent previewLetter() {
String content = this.letter.getHtmlContent();
String regex = "<a href=(\"[^\"]*\")[^<]*</a>"; //Digvijay
Pattern p = Pattern.compile(regex); //Digvijay
System.out.println("p: "+p);
Matcher m = p.matcher(content); //Digvijay
System.out.println("m: "+m);
content = m.replaceAll("<strong><u><span style=\"color:#0099cc\">$1</span></u></strong>"); //Digvijay
System.out.println("regex1: "+regex); //Digvijay
Map<String, String> previewExamples = this.getPreviewExamples(this.letter.getMessageTypeCode());
for (Entry<String, String> example : previewExamples.entrySet()) {
if (StringUtils.isNotBlank(example.getKey()) && StringUtils.isNotBlank(example.getValue())) {
content = content.replace(example.getKey(), example.getValue());
System.out.println("content after map date");
}
}
System.out.println("content1:: "+content);
if (!content.startsWith("<div>")) {
content = "<div>" + content + "</div>";
}
// Docx4j does not understand HTML codes for special characters. So replacing with Unicode values.
content = content.replace(" ", " ");
content = content.replace("’", "’");
content = content.replaceAll("</p>", "</p><br/>");
content = content.replaceAll("\"</span>", "</span>");
InputStream stream = null;
try {
System.out.println("content:"+content);
if (this.letter.getHtmlContent().getBytes() != null && this.letter.getWfTemplateId() != null) {
stream = new ByteArrayInputStream(this.HTMLToPDF(content.getBytes(), this.letter.getWfTemplateId()));
} else {
stream = new ByteArrayInputStream(this.HTMLToPDFWithoutTemplate(content.getBytes()));
}
StreamedContent file = new DefaultStreamedContent(stream, "application/pdf", this.letter.getLetterName() + ".pdf");
return file;
} catch (LetterMaintenanceException e) {
this.processServiceException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
} catch (Exception e) {
this.processGenericException(e);
StreamedContent file = new DefaultStreamedContent(
new ByteArrayInputStream(
"Unable to process your request. If the problem persists, please contact application support."
.getBytes()), "application/pdf", "error" + ".pdf");
return file;
}
}
This is my HTMLToPDF() method:
private byte[] HTMLToPDF(final byte[] htmlContent, final String templateId)
throws Docx4JException, LetterMaintenanceException {
LetterMaintenanceDelegate letterMaintenanceDelegate = new LetterMaintenanceDelegate();
Template template = letterMaintenanceDelegate.retrieveTemplateById(templateId);
if (template == null || template.getContent() == null) {
throw new LetterMaintenanceException("Could not retrieve template");
}
InputStream is = new ByteArrayInputStream(template.getContent());
WordprocessingMLPackage templatePackage = WordprocessingMLPackage.load(is);
// Convert HTML to docx
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(templatePackage);
XHTMLImporter.setHyperlinkStyle("Hyperlink");
templatePackage
.getMainDocumentPart()
.getContent()
.addAll(XHTMLImporter.convert(new ByteArrayInputStream(htmlContent), null));
// Add content of content docx to template
templatePackage.getMainDocumentPart().getContent().addAll(templatePackage.getMainDocumentPart().getContent());
// Handle page breaks
templatePackage = this.handlePagebreaksInDocx(templatePackage);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Docx4J.toPDF(templatePackage, baos);
return baos.toByteArray();
}
}
In this code I am trying to convert HTML (with href tag) to PDF file and in the PDF output the hyperlink must work.
The current output of this program is a PDF but there are no working links in it.
How can I activate my links?

Jsoup reddit scraper 429 error

So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.

Automated login and takeing screenshot using selenium+phantomjs

I'm writing a Java servlet using Selenium + PhantomJS to log into Alipay (it's like Chinese version of Paypal). I want to get the authentication code by taking screenshot of the login page. My code is as below:
package com.alipay.login.test;
import java.awt.image.BufferedImage;
import java.io.ByteArrayInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.net.URL;
import javax.imageio.ImageIO;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.openqa.selenium.By;
import org.openqa.selenium.Dimension;
import org.openqa.selenium.OutputType;
import org.openqa.selenium.Point;
import org.openqa.selenium.TakesScreenshot;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriverService;
import org.openqa.selenium.remote.DesiredCapabilities;
import com.Constants;
public class alipayGetAuthCodeServlet extends HttpServlet {
/**
*
*/
private static final long serialVersionUID = 1L;
public byte[] takeScreenshot() throws IOException {
TakesScreenshot takesScreenshot = (TakesScreenshot) Constants.driver;
return takesScreenshot.getScreenshotAs(OutputType.BYTES);
}
public BufferedImage createElementImage(WebElement webElement)
throws IOException {
Point location = webElement.getLocation();
Dimension size = webElement.getSize();
System.out.println(location + " / " + size);
BufferedImage originalImage = ImageIO.read(new ByteArrayInputStream(takeScreenshot()));
/*BufferedImage croppedImage = originalImage.getSubimage(
location.getX(),
location.getY(),
size.getWidth(),
size.getHeight());*/
return originalImage; // here I return the full screenshot for testing
}
protected void service (HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true);
caps.setCapability("takeScreenshot", true);
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "F:\\phantomjs-2.0.0-windows\\bin\\phantomjs.exe");
Constants.driver = new PhantomJSDriver(caps);
Constants.driver.get("https://auth.alipay.com/login/index.htm");
try {
Thread.sleep(5000);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
try {
Constants.img = Constants.driver.findElement(By.id("J-checkcode-img")).getAttribute("src");
} catch (Exception e1) {
}
if (!Constants.img.equals("")) {
BufferedImage captcha = createElementImage(Constants.driver.findElement(By.id("J-checkcode-img")));
response.setContentType("image/jpeg");
try {
ImageIO.write(captcha, "jpeg", response.getOutputStream());
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
When I run my servlet, the screenshot I get is like this: http://i.stack.imgur.com/rBxI9.png
However, if I use ChromeDriver instead of PhantomJSDriver, the screenshot would be like this: http://i.stack.imgur.com/UgaB4.jpg, which is what the login page should be like.
So the screenshot taken by PhantomJSDriver has wrong color (I have no idea about this), wrong size (seems that I can handle this) and most importantly, no authentication code. I have checked the html source code returned by both drivers and found that the div of auth code in PhantomJS has a class of "ui-form-item fn-hide" while the counterpart has a class of "ui-form-item". Is it because the server of Alipay examines what browser I'm using and returns different pages accordingly?
Also I cannot login with only username and password using PhantomJS so I guess I do need an auth code.
Sorry for this long question and thanks in advance for any help!

Categories