I'm trying to read the text from a PDF using Selenium-web driver and the PDFbox API. If possible I don't want to download the file, but only read the PDF from the web getting only the text of PDF into a string. The code I'm using its below, can't make to work though:
I've found examples of code to download the PDF and comparing it using the file downloaded, but none functional example extracting the text of the PDF from the URL.
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class PDFextract {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
WebDriver driver=new ChromeDriver();
driver.manage().window().maximize();
driver.get("THE URL OF SITE I CANT SHARE"); //THE URL OF SITE I CAN'T SHARE
System.out.println(driver.getTitle());
List<WebElement> list = driver.findElements(By.xpath("//a[#title='Click to open file']"));
int rows = list.size();
for (int i= 1; i <= rows; i++) {
}
List<WebElement> links = driver.findElements(By.xpath("//a[#title='Click to open file']"));
String fLinks = "";
for (WebElement link : links) {
fLinks = fLink + link.getAttribute("href");
}
fLinks = fLinks.trim();
System.out.println(fLinks); // till here the code works fine.. i get a valid url link
// the code bellow doesn't work
URL url=new URL(fLinks);
HttpURLConnection connection=(HttpURLConnection)url.openConnection();
InputStream is=connection.getInputStream();
PDDocument pdd=PDDocument.load(is);
PDFTextStripper stripper=new PDFTextStripper();
String text=stripper.getText(pdd);
pdd.close();
is.close();
System.out.println(text);
I get the error:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at PDFextract.main(PDFextract.java:106)
Edited in 07.05.2020:
#TilmanHausherr, I've done more research, this helped out in the first part, how to read a PDF from a link: Selenium Tutorial: Read PDF Content using Selenium WebDriver
This method works:
String pdfContent = readPDFContent(driver.getCurrentUrl());
public String readPDFContent(String appUrl) throws Exception {
URL url = new URL(appUrl);
InputStream is = url.openStream();
BufferedInputStream fileToParse = new BufferedInputStream(is);
PDDocument document = null;
String output = null;
try {
document = PDDocument.load(fileToParse);
output = new PDFTextStripper().getText(document);
System.out.println(output);
} finally {
if (document != null) {
document.close();
}
fileToParse.close();
is.close();
}
return output;
}
It seems my problem its the link itself, the HTML element its '< embed >', in my case there is also a 'stream-URL':
<embed id="plugin" type="application/x-google-chrome-pdf"
src="https://"SITE
I CAN'T TELL"/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a"
stream-URL="chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1" headers="accept-ranges: bytes
content-disposition: inline;filename="online.pdf"
content-length: 71488
content-security-policy: frame-ancestors 'self' https://*"SITE I CAN'T TELL"
https://*"DOMAIN I CAN'T TELL".net
content-type: application/pdf
Found this: 1. Download the File which has stream-url is the chrome extension in the embed tag using selenium
2. Handling contents of Embed tag in selenium python
But I still didn't manage to read the PDF with PDFbox because the element its '< embed>' and i might have to access the stream-URL.
Related
I am able to take screenshot by ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
In my application I have to take screenshot for every page so I want to save the multiple screenshot into a single .doc file one by one.
Is there any API?
Any Idea?
Please Help...
Easiest way - take screenshot, put it in PNG/JPEG file, read it, add it in MS-Word, delete the file, simple. here's a ready to use code for you.... BINGO...!!
import java.awt.Rectangle;
import java.awt.Robot;
import java.awt.Toolkit;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.util.concurrent.TimeUnit;
import javax.imageio.ImageIO;
import org.apache.poi.util.Units;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class TakeScreenshots {
public static void main(String[] args) {
try {
XWPFDocument docx = new XWPFDocument();
XWPFRun run = docx.createParagraph().createRun();
FileOutputStream out = new FileOutputStream("d:/xyz/doc1.docx");
for (int counter = 1; counter <= 5; counter++) {
captureScreenShot(docx, run, out);
TimeUnit.SECONDS.sleep(1);
}
docx.write(out);
out.flush();
out.close();
docx.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void captureScreenShot(XWPFDocument docx, XWPFRun run, FileOutputStream out) throws Exception {
String screenshot_name = System.currentTimeMillis() + ".png";
BufferedImage image = new Robot()
.createScreenCapture(new Rectangle(Toolkit.getDefaultToolkit().getScreenSize()));
File file = new File("d:/xyz/" + screenshot_name);
ImageIO.write(image, "png", file);
InputStream pic = new FileInputStream("d:/xyz/" + screenshot_name);
run.addBreak();
run.addPicture(pic, XWPFDocument.PICTURE_TYPE_PNG, screenshot_name, Units.toEMU(350), Units.toEMU(350));
pic.close();
file.delete();
}
}
Selenium Webdriver do not provide any feature to add snapshot in word file.
For this you need to use third party libraries.
Refer below:-
how to insert image into word document using java
How can I add an Image to MSWord document using Java
You can also add your image file in TestNG output file using reporter
Refer below :-
http://www.automationtesting.co.in/2010/07/testng-take-screenshot-of-failed-test.html
Hope it will help you :)
So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}
Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.
As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.
I was trying to get the HTML page and parse information. I just found out that some of the pages were not completely downloaded using Jsoup. I checked with curl command on command line then the complete page got downloaded. Initially I thought that it was site specific, but then I just tried to parse any big webpage randomly using Jsoup and found that it didn't download the complete webpage. I tried specifying user agent and time out properties still it failed to download. Here is the code I tried:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String[] args) throws MalformedURLException, UnsupportedEncodingException, IOException {
String urlStr = "http://en.wikipedia.org/wiki/List_of_law_clerks_of_the_Supreme_Court_of_the_United_States";
URL url = new URL(urlStr);
String content = "";
try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"))) {
for (String line; (line = reader.readLine()) != null;) {
content += line;
}
}
String article1 = Jsoup.connect(urlStr).get().text();
String article2 = Jsoup.connect(urlStr).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6").referrer("http://www.google.com").timeout(30000).execute().parse().text();
String article3 = Jsoup.parse(content).text();
System.out.println("ARTICLE 1 : "+article1);
System.out.println("ARTICLE 2 : "+article2);
System.out.println("ARTICLE 3 : "+article3);
}
}
In Article 1 and 2 when I am using Jsoup to connect to the website I am not getting complete info, but while using URL to connect I am getting the complete Page. So basically Article 3 is complete which was done using URL. I have tried with Jsoup 1.8.1 and Jsoup 1.7.2
Use method maxBodySize:
String article = Jsoup.connect(urlStr).maxBodySize(Integer.MAX_VALUE).get().text();
I have written a code which will fetch me a html contents of the page as response , I am using HTML Unit to do so . But I am getting error's for some specific urls like
[https://communities.netapp.com/welcome][1]
For first page i am able to retrieve the contents . But when i dont the content which we get using load more button .
Here's my code:
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Sample {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException {
String url = "https://communities.netapp.com/welcome";
WebClient client = new WebClient(BrowserVersion.INTERNET_EXPLORER_9);
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setRedirectEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(true);
client.getOptions().setCssEnabled(true);
client.getOptions().setUseInsecureSSL(true);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage page = client.getPage(url);
Writer output = null;
String text = page.asText();
File file = new File("D://write6.txt");
output = new BufferedWriter(new FileWriter(file));
output.write(text);
output.close();
System.out.println("Your file has been written");
// System.out.println("as Text ==" +page.asText());
// System.out.println("asXML == " +page.asXml());
// System.out.println("text content ==" +page.getTextContent());
// System.out.println(page.getWebResponse().getContentAsString());
}
}
Any suggestion ?
As i understand from your question you have a button on which you have to press.
Please look at: http://htmlunit.sourceforge.net/gettingStarted.html
You have there an example of submitting a form.
This should be very similar here
I have an XML file which have a node called "CONTENIDO", in this node I have a PDF file encoded in base64 string.
I'm trying to read this node, decode the string in base64 and download the PDF file to my computer.
The problem is that the file is downloaded with the same size (in kb) as the original PDF and has the same number of pages, but... all the pages are in blank without any content and when I open the downloaded file a popup appears with an error saying "unknown distinctive 806.6n". I don't know what that means.
I've tried to find a solution in the internet, with diferents ways to decode the string, but always get the same result... The XML is Ok I've checked the base64 string and is Ok.
I've also debugged the code and I've seen that the content of the var "fichero" where I'm reading the base64 string is also Ok, so I don't know what can be the problem.
This is my code:
package prueba.sap.com;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import sun.misc.BASE64Decoder;
import javax.xml.bind.DatatypeConverter;
public class anexoPO {
public static void main(String[] args) throws Exception {
FileInputStream inFile =
new FileInputStream("C:/prueba/prueba_attach_b64.xml");
FileOutputStream outFile =
new FileOutputStream("C:/prueba/salida.pdf");
anexoPO myMapping = new anexoPO();
myMapping.execute(inFile, outFile);
System.out.println("Success");
System.out.println(inFile);
}
public void execute(InputStream in, OutputStream out)
throws com.sap.aii.mapping.api.StreamTransformationException {
try {
//************************Code To Generate The XML Parsing Objects*****************************//
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
Document docout = db.newDocument();
NodeList CONTENIDO = doc.getElementsByTagName("CONTENIDO");
String fichero = CONTENIDO.item(0).getChildNodes().item(0).getNodeValue();
//************** decode *************/
//import sun.misc.BASE64Decoder;
//BASE64Decoder decoder = new BASE64Decoder();
//byte[] decoded = decoder.decodeBuffer(fichero);
//import org.apache.commons.codec.binary.*;
//byte[] decoded = Base64.decode(fichero);
//import javax.xml.bind.DatatypeConverter;
byte[] decoded = DatatypeConverter.parseBase64Binary(fichero);
//************** decode *************/
String str = new String(decoded);
out.write(str.getBytes());
} catch (Exception e) {
System.out.print("Problem parsing the file");
e.printStackTrace();
}
}
}
Thanks in advance.
Definitely:
out.write(decoded);
out.close();
Strings cannot represent all bytes, and PDF is binary.
Also remove the import of sun.misc.BASE64Decoder, as this package does not exist everywhere. It might be removed by the compiler, however I would not bet on it.