i would like to build a crawler in Java that give me all cookies from a website. This crawler is believed to crawl a list of websites (and obviously the undersides) automatic.
I have used jSoup and Selenium for my plan.
package com.mycompany.app;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.Cookie;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;
public class BasicWebCrawler {
private static Set<String> uniqueURL = new HashSet<String>();
private static List<String> link_list = new ArrayList<String>();
private static Set<String> uniqueCookies = new HashSet<String>();
private static void get_links(String url) {
Connection connection = null;
Connection.Response response = null;
String this_link = null;
try {
connection = Jsoup.connect(url);
response = connection.execute();
//cookies_http = response.cookies();
// fetch the document over HTTP
Document doc = response.parse();
// get all links in page
Elements links = doc.select("a[href]");
if(links.isEmpty()) {
return;
}
for (Element link : links) {
this_link = link.attr("href");
boolean add = uniqueURL.add(this_link);
System.out.println("\n" + this_link + "\n" + "title: " + doc.title());
if (add && (this_link.contains(url))) {
System.out.println("\n" + this_link + "\n" + "title: " + doc.title());
link_list.add(this_link);
get_links(this_link);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
get_links("https://de.wikipedia.org/wiki/Wikipedia");
/**
* Hier kommt Selenium ins Spiel
*/
WebDriver driver;
System.setProperty("webdriver.chrome.driver", "D:\\crawler\\driver\\chromedriver.exe");
driver = new ChromeDriver();
// create file named Cookies to store Login Information
File file = new File("Cookies.data");
FileWriter fileWrite = null;
BufferedWriter Bwrite = null;
try {
// Delete old file if exists
file.delete();
file.createNewFile();
fileWrite = new FileWriter(file);
Bwrite = new BufferedWriter(fileWrite);
// loop for getting the cookie information
} catch (Exception ex) {
ex.printStackTrace();
}
for(String link : link_list) {
System.out.println("Open Link: " + link);
driver.get(link);
try {
// loop for getting the cookie information
for (Cookie ck : driver.manage().getCookies()) {
String tmp = (ck.getName() + ";" + ck.getValue() + ";" + ck.getDomain() + ";" + ck.getPath() + ";" + ck.getExpiry() + ";" + ck.isSecure());
if(uniqueCookies.add(tmp)) {
Bwrite.write("Link: " + link + "\n" + (ck.getName() + ";" + ck.getValue() + ";" + ck.getDomain() + ";" + ck.getPath() + ";" + ck.getExpiry() + ";" + ck.isSecure())+ "\n\n");
Bwrite.newLine();
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
try {
Bwrite.close();
fileWrite.close();
driver.close();
driver.quit();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I test this code on a wikipedia page and compare the result with a cookie scanner call CookieMetrix.
My code shows only four cookies:
Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
GeoIP;DE:NW:M__nster:51.95:7.54:v4;.wikipedia.org;/;null;true
Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
WMF-Last-Access-Global;13-May-2019;.wikipedia.org;/;Mon Jan 19 02:28:33 CET 1970;true
Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
WMF-Last-Access;13-May-2019;de.wikipedia.org;/;Mon Jan 19 02:28:33 CET 1970;true
Link: https://de.wikipedia.org/wiki/Wikipedia:Lizenzbestimmungen_Commons_Attribution-ShareAlike_3.0_Unported
mwPhp7Seed;55e;de.wikipedia.org;/;Mon Jan 19 03:09:08 CET 1970;false
But the cookie scanner shows seven. I don't know why my code shows lesser than the CookieMetrix. Can you help me?
JavaDoc for java.util.Set<Cookie> getCookies():
Get all the cookies for the current domain. This is the equivalent of calling "document.cookie" and parsing the result
document.cookie will not return HttpOnly cookies, simply because JavaScript does not allow it.
Also notice that the “CookieMetrix” seems to list cookies from different domains.
Solutions:
To get a listing such as “CookieMetrix” (1+2) you could add a proxy after your browser and sniff the requests.
In case you want to get all cookies for the current domain, including HttpOnly (1), you could try accessing Chrome’s DevTools API directly (afair, it’ll also return HttpOnly cookies)
Related
I've created a webapplication which creates folders when filling destination input ( for example => C:\xxx\xxx path).
When i run on my local (http:\localhost:8080), it works perfectly. it finds local windows path and creates folders.
But now i want to open this webapp to group of people, deployed tomcat on internal unix server (http:\ipnumber\portnumber).
The problem is that when user fills input with local destination, program code can not find the path or can not access local computer folder structure, it looks unix server folder structure.
How can i achieve this? I use angularjs for frontend with call restapi with http.post, the backend side is java.
package com.ama.ist.controller;
import java.util.HashMap;
import java.util.Map;
import java.util.UUID;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.HttpStatus;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.CrossOrigin;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;
import com.ama.ist.model.CustomErrorType;
import com.ama.ist.model.Patch;
import com.ama.ist.service.PatchService;
#RestController
public class PatchController {
#Autowired
private PatchService patchService;
#CrossOrigin(origins = "http://ipnumber:portnumber")
#RequestMapping(value = "/mk", method = RequestMethod.POST)
public ResponseEntity<?> createFolder(#RequestBody Patch patch) {
System.out.println("patch ddest: => " + patch.getDestination());
String iscreatedstatus = patchService.create(patch);
System.out.println("iscreatedstatus" + iscreatedstatus);
if (!(iscreatedstatus.equals("Success"))) {
System.out.println("if success" );
return new ResponseEntity<Object>(new CustomErrorType("ER",iscreatedstatus), HttpStatus.NOT_FOUND);
}
System.out.println("if disinda success" );
return new ResponseEntity<Object>(new CustomErrorType("OK",iscreatedstatus), HttpStatus.CREATED);
}
//
#RequestMapping("/resource")
public Map<String,Object> home() {
Map<String,Object> model = new HashMap<String,Object>();
model.put("id", UUID.randomUUID().toString());
model.put("content", "Hello World");
return model;
}
}
This is Service
package com.ama.ist.service;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.springframework.stereotype.Service;
import org.tmatesoft.svn.core.SVNDepth;
import org.tmatesoft.svn.core.SVNException;
import org.tmatesoft.svn.core.SVNProperties;
import org.tmatesoft.svn.core.SVNURL;
import org.tmatesoft.svn.core.auth.BasicAuthenticationManager;
import org.tmatesoft.svn.core.auth.ISVNAuthenticationManager;
import org.tmatesoft.svn.core.wc.SVNCommitClient;
import org.tmatesoft.svn.core.wc.SVNWCUtil;
import com.ama.ist.model.Patch;
import com.ama.ist.model.User;
#Service
public class PatchService {
public String create(Patch patch) {
String ConstantPath = patch.getDestination();
File testFile = new File("");
String currentPath = testFile.getAbsolutePath();
System.out.println("current path is: " + currentPath);
System.out.println("ConstantPath => " + ConstantPath);
// if (!(isValidPath(ConstantPath))) {
// return "invalid Path";
// }
// System.out.println("Valid mi " + isValidPath(ConstantPath));
String foldername = patch.getWinNum() + " - " + patch.getWinName();
System.out.println(ConstantPath + foldername);
File files = new File(ConstantPath + foldername);
if (files.exists()) {
return "The Folder is already created in that path";
}
File files1 = new File(ConstantPath + foldername + "\\Patch");
File files2 = new File(ConstantPath + foldername + "\\Backup");
File files3 = new File(ConstantPath + foldername + "\\Backup\\UAT");
File files4 = new File(ConstantPath + foldername + "\\Backup\\PROD");
if (!files.exists()) {
if (files.mkdirs()) {
files1.mkdir();
files2.mkdir();
files3.mkdir();
files4.mkdir();
createReadme(ConstantPath + foldername, patch);
if (patch.isChecked()) {
System.out.println("patch.getDestination => " + patch.getDestination());
System.out.println("patch.getDetail => " + patch.getDetail());
System.out.println("patch.getSvnPath => " + patch.getSvnPath());
System.out.println("patch.getWinName => " + patch.getWinName());
System.out.println("patch.getWinNum => " + patch.getWinNum());
System.out.println("patch.getUserName => " + patch.getUser().getUserName());
System.out.println("patch.getPassword => " + patch.getUser().getPassword());
ImportSvn(patch);
}
System.out.println("Multiple directories are created!");
return "Success";
} else {
System.out.println("Failed to create multiple directories!");
return "Unknwon error";
}
} else {
return "File name is already exists";
}
}
public static boolean isValidPath(String path) {
System.out.println("path => " + path);
File f = new File(path);
if (f.isDirectory()) {
System.out.println("true => ");
return true;
} else {
System.out.println("false => ");
return false;
}
}
public void createReadme(String path, Patch patch) {
try {
ClassLoader classLoader = getClass().getClassLoader();
File file = new File(classLoader.getResource("Readme.txt").getFile());
// System.out.println("!!!!!!!!!!" + new java.io.File("").getAbsolutePath());
// File file = new File("resources/Readme.txt");
System.out.println(file.getAbsolutePath());
FileReader reader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(reader);
String line;
PrintWriter writer = new PrintWriter(path + "\\Readme.txt", "UTF-8");
System.out.println(path + "\\Readme.txt");
while ((line = bufferedReader.readLine()) != null) {
line = line.replace("#Winnumber", Integer.toString(patch.getWinNum()));
line = line.replace("#NameSurname", " ");
line = line.replace("#Type", "Package");
line = line.replace("#detail", patch.getDetail());
SimpleDateFormat sdf = new SimpleDateFormat("dd/MM/yyyy");
String date = sdf.format(new Date());
line = line.replace("#Date", date);
line = line.replace("#Desc", patch.getWinName());
writer.println(line);
System.out.println(line);
}
reader.close();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
public void ImportSvn(Patch patch) {
String name = patch.getUser().getUserName();
String password = patch.getUser().getPassword();
// String filename = patch.getWinName()
String filename = patch.getWinNum() + " - " + patch.getWinName();
String url = patch.getSvnPath() + "/" + filename;
ISVNAuthenticationManager authManager = new BasicAuthenticationManager(name, password);
SVNCommitClient commitClient = new SVNCommitClient(authManager, SVNWCUtil.createDefaultOptions(true));
File f = new File(patch.getDestination() + filename);
try {
String logMessage = filename;
commitClient.doImport(f, // File/Directory to be imported
SVNURL.parseURIEncoded(url), // location within svn
logMessage, // svn comment
new SVNProperties(), // svn properties
true, // use global ignores
false, // ignore unknown node types
SVNDepth.INFINITY);
// SVNClientManager cm =
// SVNClientManager.newInstance(SVNWCUtil.createDefaultOptions(true),authManager);
//
// SVNUpdateClient uc = cm.getUpdateClient();
// long[] l = uc.doUpdate(new File[]{dstPath},
// SVNRevision.HEAD,SVNDepth.INFINITY, true,true);
} catch (SVNException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
This is Angularjs side
$scope.Create = function() {
$scope.obj = [];
console.log("$scope.svnPath" + $scope.patch.svnPath);
console.log("$scope.userName" + $scope.patch.user.userName);
$http({
method : "POST",
url : "http://ipnumber:port/patchinit/mk",
data : $scope.patch
}).then(function mySuccess(response) {
console.log("Success!! ");
$scope.obj = response.data;
$scope.errorMessage = response.data.errorMessage;
$scope.errorCode = response.data.errorCode;
}, function myError(response) {
//$scope.obj = response.statusText;
$scope.errorMessage = response.data.errorMessage;
$scope.errorCode = response.data.errorCode;
});
}
You can share that folder on windows and mount that shared folder in unix. Once mounted, it can be easily accessed using samba(smb://192.168.1.117/Your_Folder).
Samba is standard on nearly all distributions of Linux and is commonly included as a basic system service on other Unix-based operating systems as well.
I have managed a code to handle a file.
Now I want to use the same code to handle all the XML files which are located in a directory.
Can someone tell me how can I declare the path and how to look for a loop.
Thanks in advance
import org.xml.sax.SAXException;
import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.IOException;
public class XmlReadWrite3 {
public static void main(String[] args) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse("C:/Users/Desktop/1381.xml");
Element langs = doc.getDocumentElement();
Element filename= getElement(langs, "Filename");
Element beschreibung = getElement(langs, "Beschreibung");
Element name = getElement(langs, "Name");
Element ide = getElement(langs, "IDe");
System.out.println("Filename: " + filename.getTextContent() + "\n" + "Beschreibung: "
+ beschreibung.getTextContent() + "\n" + "Ersteller: " + name.getTextContent() + "\n"
+ "Pnummer: " + ide.getTextContent() + "\n\n");
}catch (ParserConfigurationException pce) {
pce.printStackTrace();
} catch (SAXException se) {
se.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
private static Element getElement(Element langs, String tag){
return (Element) langs.getElementsByTagName(tag).item(0);
}
}
Hi you can use the Path and File classes to loop through a directory:
import org.xml.sax.SAXException;
import org.w3c.dom.*;
import javax.xml.parsers.*;
import java.io.IOException;
import java.nio.file.DirectoryStream;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
class XmlReadWrite3 {
public static void main(String[] args) {
// here you enter the path to your directory.
// for example: Path workDir = Paths.get("c:\\workspace\\xml-files")
Path workDir = Paths.get("path/to/dir"); // enter the path to your xml-dir
// the if checks whether the directory truly exists
if (!Files.notExists(workDir)) {
// this part stores all files withn the directory in a list
try (DirectoryStream<Path> directoryStream = Files.newDirectoryStream(workDir)) {
for (Path path : directoryStream) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(path.toString());
Element langs = doc.getDocumentElement();
Element filename = getElement(langs, "Filename");
Element beschreibung = getElement(langs, "Beschreibung");
Element name = getElement(langs, "Name");
Element ide = getElement(langs, "IDe");
System.out.println("Filename: " + filename.getTextContent() + "\n" + "Beschreibung: "
+ beschreibung.getTextContent() + "\n" + "Ersteller: " + name.getTextContent() + "\n"
+ "Pnummer: " + ide.getTextContent() + "\n\n");
} catch (ParserConfigurationException pce) {
pce.printStackTrace();
} catch (SAXException se) {
se.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
} catch (Exception e) {
System.out.println(e.getMessage())
}
}
}
private static Element getElement(Element langs, String tag) {
return (Element) langs.getElementsByTagName(tag).item(0);
}
}
I am successful in reading the content of the gmail-email using "JAVAMail" and I am able to store it in a string. Now I want to get a specific registration URL from the content (String). How can I do this, The String contains plenty of tags and href but I want to extract only the URL that is provided in a hyper link on a word " click here" that exist in the below mentioned statement
"Please <a class="h5" href="https://newstaging.mobilous.com/en/user-register/******" target="_blank">click here</a> to complete your registration".
on the hyper link "click here" the url
href="https://newstaging.mobilous.com/en/user-register/******" target="_blank"
I have tried this by using the following code
package email;
import java.util.ArrayList;
import java.util.Properties;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.mail.Folder;
import javax.mail.Message;
import javax.mail.MessagingException;
import javax.mail.NoSuchProviderException;
import javax.mail.Session;
import javax.mail.Store;
public class emailAccess {
public static void check(String host, String storeType, String user,
String password)
{
try {
//create properties field
Properties properties = new Properties();
properties.put("mail.imap.host",host);
properties.put("mail.imap.port", "993");
properties.put("mail.imap.starttls.enable", "true");
properties.setProperty("mail.imap.socketFactory.class","javax.net.ssl.SSLSocketFactory");
properties.setProperty("mail.imap.socketFactory.fallback", "false");
properties.setProperty("mail.imap.socketFactory.port",String.valueOf(993));
Session emailSession = Session.getDefaultInstance(properties);
//create the POP3 store object and connect with the pop server
Store store = emailSession.getStore("imap");
store.connect(host, user, password);
//create the folder object and open it
Folder emailFolder = store.getFolder("INBOX");
emailFolder.open(Folder.READ_ONLY);
// retrieve the messages from the folder in an array and print it
Message[] messages = emailFolder.getMessages();
System.out.println("messages.length---" + messages.length);
int n=messages.length;
for (int i = 0; i<n; i++) {
Message message = messages[i];
ArrayList<String> links = new ArrayList<String>();
if(message.getSubject().contains("Thank you for signing up for AppExe")){
String desc=message.getContent().toString();
// System.out.println(desc);
Pattern linkPattern = Pattern.compile(" <a\\b[^>]*href=\"[^>]*>(.*?)</a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(desc);
while(pageMatcher.find()){
links.add(pageMatcher.group());
}
}else{
System.out.println("Email:"+ i + " is not a wanted email");
}
for(String temp:links){
if(temp.contains("user-register")){
System.out.println(temp);
}
}
/*System.out.println("---------------------------------");
System.out.println("Email Number " + (i + 1));
System.out.println("Subject: " + message.getSubject());
System.out.println("From: " + message.getFrom()[0]);
System.out.println("Text: " + message.getContent().toString());*/
}
//close the store and folder objects
emailFolder.close(false);
store.close();
} catch (NoSuchProviderException e) {
e.printStackTrace();
} catch (MessagingException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
String host = "imap.gmail.com";
String mailStoreType = "imap";
String username = "rameshakur#gmail.com";
String password = "*****";
check(host, mailStoreType, username, password);
}
}
On executing I got the out put as
< class="h5" href="https://newstaging.mobilous.com/en/user-register/******" target="_blank">
How can I extract only the href value i.e. https://newstaging.mobilous.com/en/user-register/******
Please suggest, thanks.
You're close. You're using group(), but you've got a couple issues. Here's some code that should work, replacing just a bit of what you've got:
Pattern linkPattern = Pattern.compile(" <a\\b[^>]*href=\"([^\"]*)[^>]*>(.*?)</a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(desc);
while(pageMatcher.find()){
links.add(pageMatcher.group(1));
}
All I did was to change your pattern so it explicitly looks for the end-quote of the href attribute, then wrapped the portion of the pattern that was the string you're looking for in parentheses.
I also added an argument to the pageMather.group() method, as it needs one.
Tell you the truth, you could probably just use this pattern instead (along with the .group(1) change):
Pattern linkPattern = Pattern.compile("href=\"([^\"]*)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
I am using itext and converting html to pdf for that i am using this code
import java.io.FileOutputStream;
import java.io.StringReader;
import javax.sql.rowset.spi.XmlWriter;
import com.itextpdf.text.Chunk;
import com.itextpdf.text.Document;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.html.simpleparser.HTMLWorker;
import com.itextpdf.text.pdf.PdfWriter;
public class HtmlToPDF2 {
// itextpdf-5.4.1.jar http://sourceforge.net/projects/itext/files/iText/
public static void main(String ... args ) {
try {
Document document = new Document(PageSize.LETTER);
PdfWriter.getInstance(document, new FileOutputStream("testpdf1.pdf"));
document.open();
HTMLWorker htmlWorker = new HTMLWorker(document);
String firstName = "<name>" ;
String sign = "<sign>";
String str = "<html> " +
"<body>" +
"<form>" +
"<div><strong>Dear</strong> "+firstName +",</div><br/>"+
"<div>"+
"<P> It is informed that you are selected in your interview<br/>"+
" and please report on the <b>20 may</b> with your all original <br/>"+
" document on our head office at jaipur.>"+
" </P>"+
" </div><br/>"+
" <div>"+
" <p>Yours sincierly </p><br/>"+sign+"</div>"+
" </form>"+
"<body>"+
"<html>";
htmlWorker.parse(new StringReader(str));
document.close();
System.out.println("Done");
}
catch (Exception e) {
e.printStackTrace();
}
}
}
but this will give me output
desired output is
and is it correct way to create placeholder .. or i need to do anything else to create placeholder ? if yes then please suggest me .
< and > signs consider as html tags. Because of that it don't show in your pdf.
you can define firstName and sign as below..
public class HtmlToPDF2 {
public static void main(String ... args ) {
....
....
String firstName = "<name>" ;
String sign = "<sign>";
....
....
}
}
This may seem an old question, but I didn't find an exhaustive answer after spending half an hour searching all over SO.
I am using PDFBox and I would like to extract all of the text from a PDF file along with the coordinates of each string. I am using their PrintTextLocations example (http://pdfbox.apache.org/apidocs/org/apache/pdfbox/examples/util/PrintTextLocations.html) but with the kind of pdf I am using (E-Tickets) the program fails to recognize strings, printing each character separately. The output is a list of strings (each representing a TextPosition object) like this:
String[414.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.0] s
String[418.93896,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] a
String[423.38696,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=1.776001] l
String[425.16296,637.2442 fs=1.0 xscale=8.0 height=4.94 space=2.2240002 width=4.447998] e
While I would like the program to recognize the string "sale" as an unique TextPosition and give me its position.
I also tried to play with the setSpacingTolerance() and setAverageCharacterTolerance() PDFTextStripper methods, setting different values above and under the standard values (which FYI are 0.5 and 0.3 respectively), but the output didn't change at all. Where am I going wrong? Thanks in advance.
As Joey mentioned, PDF is just a collection of instructions telling you where a certain character should be printed.
In order to extract words or lines, you will have to perform some data segmentation: studying the bounding boxes of the characters should let you recognize those that are on a same line and then which one form words.
Here is your Solution:
1. Reading File
2. Fetching Each Page to Text by using PDFParserTextStripper
3. Each Position of the text will be printed by char.
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
class PDFParserTextStripper extends PDFTextStripper {
public PDFParserTextStripper(PDDocument pdd) throws IOException {
super();
document = pdd;
}
public void stripPage(int pageNr) throws IOException {
this.setStartPage(pageNr + 1);
this.setEndPage(pageNr + 1);
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
writeText(document, dummy); // This call starts the parsing process and calls writeString repeatedly.
}
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text : textPositions) {
System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSizeInPt()
+ " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space="
+ text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + " ] " + text.getUnicode());
}
}
public static void extractText(InputStream inputStream) {
PDDocument pdd = null;
try {
pdd = PDDocument.load(inputStream);
PDFParserTextStripper stripper = new PDFParserTextStripper(pdd);
stripper.setSortByPosition(true);
for (int i = 0; i < pdd.getNumberOfPages(); i++) {
stripper.stripPage(i);
}
} catch (IOException e) {
// throw error
} finally {
if (pdd != null) {
try {
pdd.close();
} catch (IOException e) {
}
}
}
}
public static void main(String[] args) throws IOException {
File f = new File("C://PDFLOCATION//target.pdf");
FileInputStream fis = null;
try {
fis = new FileInputStream(f);
extractText(fis);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
if (fis != null)
fis.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}