How to split pdf file by result in java pdfbox - java

I hve one pdf file, which contain 60 pages. In each pages I've unique and repeated Invoice Nos. Im using Apache PDFBOX.
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;
public class PDFtest1 {
public static void main(String[] args){
PDDocument pd;
try {
File input = new File("G:\\Sales.pdf");
// StringBuilder to store the extracted text
StringBuilder sb = new StringBuilder();
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
// Add text to the StringBuilder from the PDF
sb.append(stripper.getText(pd));
Pattern p = Pattern.compile("Invoice No.\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(sb);
while (m.find()){
// group() method refers to the next number that follows the pattern we have specified.
System.out.println(m.group());
}
if (pd != null) {
pd.close();
}
} catch (Exception e){
e.printStackTrace();
}
}
}
I'm able to read all Invoice Nos. using java regex.
Finally the Result is as follow
run:
Invoice No. D0000003010
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003013
Invoice No. D0000003013
Invoice No. D0000003014
Invoice No. D0000003014
Invoice No. D0000003015
Invoice No. D0000003016
I need to split the pdf according to tht Invoice No.s. For example Invoice No. D0000003011, all pdf pages should be merge as a single pdf and so on.
Hw can i achive dis. ..

public static void main(String[] args) throws IOException, COSVisitorException
{
File input = new File("G:\\Sales.pdf");
PDDocument outputDocument = null;
PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
PDFTextStripper stripper = new PDFTextStripper();
String currentNo = null;
for (int page = 1; page <= inputDocument.getNumberOfPages(); ++page)
{
stripper.setStartPage(page);
stripper.setEndPage(page);
String text = stripper.getText(inputDocument);
Pattern p = Pattern.compile("Invoice No.(\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d)");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(text);
String no = null;
if (m.find())
{
no = m.group(1);
}
System.out.println("page: " + page + ", value: " + no);
PDPage pdPage = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);
if (no != null && !no.equals(currentNo))
{
saveCloseCurrent(currentNo, outputDocument);
// create new document
outputDocument = new PDDocument();
currentNo = no;
}
if (no == null && currentNo == null)
{
System.out.println ("header page ??? " + page + " skipped");
continue;
}
// append page to current document
outputDocument.importPage(pdPage);
}
saveCloseCurrent(currentNo, outputDocument);
inputDocument.close();
}
private static void saveCloseCurrent(String currentNo, PDDocument outputDocument)
throws IOException, COSVisitorException
{
// save to new output file
if (currentNo != null)
{
// save document into file
File f = new File(currentNo + ".pdf");
if (f.exists())
{
System.err.println("File " + f + " exists?!");
System.exit(-1);
}
outputDocument.save(f);
outputDocument.close();
}
}
Beware:
this has not been tested with your file (because I don't have it);
the code makes the assumption that identical invoice numbers are always together;
your regular expression has been changed slightly;
make sure that the first and the last PDF files are correct, and check a few at random, and with different viewers if available;
verify that the total count of files is as expected;
the summed up size of all files will be bigger than the source file, this is because of the font resources;
use the 1.8.10 version. Don't use PDFBox 0.7.3.jar at the same time!
error handling is very basic, you need to change it;
update 19.8.2015:
it now supports pages with no invoice number, these will be appended.

Related

How to create a new PDF file if a file name already exists?

My code below outputs a simple receipt which contains some details from the user like name, fare and stop number. This generates a PDF file containing those details. Whenever a new user inputs data in the main form, this just overwrite the data of the first user in the PDF file. How can I be able to create a new PDF file without appending or overwriting the original data of the first user? (like sample.pdf, sample2.pdf, sample3.pdf...and so on)
public class PDFDisplay {
public static void generatePDF(PassengerBean passengerBean) {
Document document = new Document();
try {
final Chunk NEWLINE = new Chunk("\n");
PdfWriter.getInstance(document,
new FileOutputStream("C://sample.pdf"));
document.open();
Image img = Image.getInstance("C:\\Documents and Settings\\Pinky\\My Documents\\Angel's files\\ICS 113\\eclipse_ws\\MRTApplicationIteration2\\WebContent\\image\\mrt.jpg");
document.add(img);
String or = "Official Receipt";
String hr = "-----------------------------------------------------------";
String spacer = " ";
String name = "Passenger Name: " + passengerBean.lname + "," + " " + passengerBean.fname;
String dest = "Destination: " + passengerBean.dest + " STATION";
String stopno = passengerBean.stop;
double fare = passengerBean.fare;
String fare1 = "Fare: PHP" + " " + String.valueOf(fare);
String ccnum = "CREDIT CARD NUMBER: " + " " + "************" + passengerBean.ccnum.substring(Math.max(0, passengerBean.ccnum.length() - 4));
Paragraph para10 = new Paragraph(32);
para10.setSpacingBefore(10);
para10.setSpacingAfter(10);
para10.add(new Chunk(or));
document.add(para10);
Paragraph para9 = new Paragraph(32);
para9.setSpacingBefore(30);
para9.setSpacingAfter(10);
para9.add(new Chunk(hr));
document.add(para9);
// Setting paragraph line spacing to 32
Paragraph para1 = new Paragraph(32);
para1.setSpacingBefore(5);
para1.setSpacingAfter(10);
para1.add(new Chunk(name));
document.add(para1);
Paragraph para2 = new Paragraph();
para2.setSpacingAfter(10);
para2.add(new Chunk(dest));
document.add(para2);
Paragraph para3 = new Paragraph();
para3.setSpacingAfter(10);
para3.add(new Chunk(stopno));
document.add(para3);
Paragraph para4 = new Paragraph();
para4.setSpacingAfter(10);
para4.add(new Chunk(fare1));
document.add(para4);
Paragraph para5 = new Paragraph();
para5.setSpacingAfter(10);
para5.add(new Chunk(ccnum));
document.add(para5);
document.close();
} catch (DocumentException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Almost all the methods you might need to achieve what you want can be found in the Java API documentation for the File class
You want to create a unique file that starts with sample and ends with pdf. To achieve this, you can use the createTempFile() method. This question was already answered on StackOverflow 6 years ago: What is the best way to generate a unique and short file name in Java
Suppose that you really want to have incremental numbers in your file name, e.g. sample0001.pdf, sample0002.pdf, sample0003.pdf and so on, then you can use the list() method. This returns an array of String values with the names of all files in a directory. I suggest that you use a FilenameFilter so that you only get the PDF files starting with sample. You could then sort these names to find the name with the highest number. See How to list latest files in a directory using FileNameFilter to find out how to create such a filter.
Once you have the file name with the highest number, it's only a matter of String manipulation to create a new filename. Use that filename (or that File instance) when you define the OutputStream.
As you can see, this answer doesn't mention iText anywhere and although the extension of the files we create or list is .pdf, it has nothing to do with PDF or PDF generation either. It's a pure Java question.

Replacing a text in Apache POI XWPF not working

I'm currently trying to work on the code mentioned on a previous post called Replacing a text in Apache POI XWPF.
I have tried the below and it works but I don't know if I am missing anything. When I run the code the text is not replaced but added onto the end of what was searched. For example I have created a basic word document and entered the text "test". In the below code when I run it I eventually get the new document with the text "testDOG".
I have had to change the original code from String text = r.getText(0) to String text = r.toString() because I kept getting a NullError while running the code.
import java.io.*;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class testPOI {
public static void main(String[] args) throws Exception{
String filepath = "F:\\MASTER_DOC.docx";
String outpath = "F:\\Test.docx";
XWPFDocument doc = new XWPFDocument(OPCPackage.open(filepath));
for (XWPFParagraph p : doc.getParagraphs()){
for (XWPFRun r : p.getRuns()){
String text = r.toString();
if(text.contains("test")) {
text = text.replace("test", "DOG");
r.setText(text);
}
}
}
doc.write(new FileOutputStream(outpath));
}
EDIT: Thanks for your help everyone. I browsed around and found a solution on Replace table column value in Apache POI
This method replace search Strings in paragraphs and is able to work with Strings spanning over more than one Run.
private long replaceInParagraphs(Map<String, String> replacements, List<XWPFParagraph> xwpfParagraphs) {
long count = 0;
for (XWPFParagraph paragraph : xwpfParagraphs) {
List<XWPFRun> runs = paragraph.getRuns();
for (Map.Entry<String, String> replPair : replacements.entrySet()) {
String find = replPair.getKey();
String repl = replPair.getValue();
TextSegement found = paragraph.searchText(find, new PositionInParagraph());
if ( found != null ) {
count++;
if ( found.getBeginRun() == found.getEndRun() ) {
// whole search string is in one Run
XWPFRun run = runs.get(found.getBeginRun());
String runText = run.getText(run.getTextPosition());
String replaced = runText.replace(find, repl);
run.setText(replaced, 0);
} else {
// The search string spans over more than one Run
// Put the Strings together
StringBuilder b = new StringBuilder();
for (int runPos = found.getBeginRun(); runPos <= found.getEndRun(); runPos++) {
XWPFRun run = runs.get(runPos);
b.append(run.getText(run.getTextPosition()));
}
String connectedRuns = b.toString();
String replaced = connectedRuns.replace(find, repl);
// The first Run receives the replaced String of all connected Runs
XWPFRun partOne = runs.get(found.getBeginRun());
partOne.setText(replaced, 0);
// Removing the text in the other Runs.
for (int runPos = found.getBeginRun()+1; runPos <= found.getEndRun(); runPos++) {
XWPFRun partNext = runs.get(runPos);
partNext.setText("", 0);
}
}
}
}
}
return count;
}
Your logic is not quite right. You need to collate all the text in the runs first and then do the replace. You also need to remove all runs for the paragraph and add a new single run if a match on "test" is found.
Try this instead:
public class testPOI {
public static void main(String[] args) throws Exception{
String filepath = "F:\\MASTER_DOC.docx";
String outpath = "F:\\Test.docx";
XWPFDocument doc = new XWPFDocument(new FileInputStream(filepath));
for (XWPFParagraph p : doc.getParagraphs()){
int numberOfRuns = p.getRuns().size();
// Collate text of all runs
StringBuilder sb = new StringBuilder();
for (XWPFRun r : p.getRuns()){
int pos = r.getTextPosition();
if(r.getText(pos) != null) {
sb.append(r.getText(pos));
}
}
// Continue if there is text and contains "test"
if(sb.length() > 0 && sb.toString().contains("test")) {
// Remove all existing runs
for(int i = 0; i < numberOfRuns; i++) {
p.removeRun(i);
}
String text = sb.toString().replace("test", "DOG");
// Add new run with updated text
XWPFRun run = p.createRun();
run.setText(text);
p.addRun(run);
}
}
doc.write(new FileOutputStream(outpath));
}
}
Worth noticing that, run.getPosition() returns -1 most of the cases. But it does not effect when there is only one text postion per a run. But, technically it can have any number of textPositions and I've experienced such cases. So, the best way is to getCTR () for run and terate through each the run for count of textPositions. Number of textPositions are equal to ctrRun.sizeOfTArray()
A sample code
for (XWPFRun run : p.getRuns()){
CTR ctrRun = run.getCTR();
int sizeOfCtr = ctrRun.sizeOfTArray();
for(int textPosition=0; textPosition<sizeOfCtr){
String text = run.getText(textPosition);
if(text.contains("test")) {
text = text.replace("test", "DOG");
r.setText(text,textPosition);
}
}
}
just change text for every run in your paragraph, and then save the file.
this code worked for mi
XWPFDocument doc = new XWPFDocument(new FileInputStream(filepath));
for (XWPFParagraph p : doc.getParagraphs()) {
StringBuilder sb = new StringBuilder();
for (XWPFRun r : p.getRuns()) {
String text = r.getText(0);
if (text != null && text.contains("variable1")) {
text = text.replace("variable1", "valeur1");
r.setText(text, 0);
}
if (text != null && text.contains("variable2")) {
text = text.replace("variable2", "valeur2");
r.setText(text, 0);
}
if (text != null && text.contains("variable3")) {
text = text.replace("variable3", "valeur3");
r.setText(text, 0);
}
}
}
doc.write(new FileOutputStream(outpath));

How to remove headers and footer from pdf file using pdfbox in java

I am using Pdf Parser to convert pdf to text.Below is my code to convert pdf to text file using java.
My PDF file contains Following Data:
Data Sheet(Header)
PHP Courses for PHP Professionals(Header)
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
All copyrights reserved.(Footer).
Sample code:
public class PDF_TEST {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
// PDFTextParser Constructor
public PDF_TEST() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
File f = new File(fileName);
if (!f.isFile()) {
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
e.printStackTrace();
}
}
//Extracts text from a PDF Document and writes it to a text file
public static void test() {
String args[]={"C://Sample.pdf","C://Sample.txt"};
if (args.length != 2) {
System.exit(1);
}
PDFTextParser pdfTextParserObj = new PDFTextParser();
String pdfToText = pdfTextParserObj.pdftoText(args[0]);
if (pdfToText == null) {
}
else {
pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
}
}
public static void main(String args[]) throws IOException
{
test();
}
}
The above code works for extracting pdf to text.But my requirement is to ignore Header and Footer and extract only content from pdf file.
Required output:
Networking Academy
We live in an increasingly connected world, creating a global economy and a growing need for technical skills. Networking Academy delivers information technology skills to over 500,000 students a year in more than 165 countries worldwide. Networking Academy students have the opportunity to participate in a powerful and consistent learning experience that is supported by high quality, online curricula and assessments, instructor training, hands-on labs, and classroom interaction. This experience ensures the same level of qualifications and skills regardless of where in the world a student is located.
Please suggest me how to do this.
Thanks.
In general there is nothing special about header or footer texts in PDFs. It is possible to tag that material differently, but tagging is optional and the OP did not provide a sample PDF to check.
Thus, some manual work (or somewhat failure intensive image analysis) generally is necessary to find the regions on the pages for header, content, and footer material.
As soon as you have the coordinates for these regions, though, you can use the PDFTextStripperByAreawhich extends the PDFTextStripper to collect text by regions. Simply define a region for the page content using the largest rectangle including the content but excluding headers and footers, and after pdfStripper.getText(pdDoc) call getTextForRegion for the defined region.
You can use PDFTextStripperByArea to remove "Header" and "Footer" by pdf file.
Code in java using PDFBox.
public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
File file = new File(path + filename);
PDDocument document = PDDocument.load(file);
//Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
String regionName = "region";
PDFTextStripperByArea stripper;
PDPage page = document.getPage(pageNumber + 1);
stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
String text = stripper.getTextForRegion(regionName);
return text;
}

How can I remove all images/drawings from a PDF file and leave text only in Java?

I have a PDF file that's an output from an OCR processor, this OCR processor recognizes the image, adds the text to the pdf but at the end places a low quality image instead of the original one (I have no idea why anyone would do that, but they do).
So, I would like to get this PDF, remove the image stream and leave the text alone, so that I could get it and import (using iText page importing feature) to a PDF I'm creating myself with the real image.
And before someone asks, I have already tried to use another tool to extract text coordinates (JPedal) but when I draw the text on my PDF it isn't at the same position as the original one.
I'd rather have this done in Java, but if another tool can do it better, just let me know. And it could be image removal only, I can live with a PDF with the drawings in there.
I used Apache PDFBox in similar situation.
To be a little bit more specific, try something like that:
import org.apache.pdfbox.exceptions.COSVisitorException;
import org.apache.pdfbox.exceptions.CryptographyException;
import org.apache.pdfbox.exceptions.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import java.io.IOException;
public class Main {
public static void main(String[] argv) throws COSVisitorException, InvalidPasswordException, CryptographyException, IOException {
PDDocument document = PDDocument.load("input.pdf");
if (document.isEncrypted()) {
document.decrypt("");
}
PDDocumentCatalog catalog = document.getDocumentCatalog();
for (Object pageObj : catalog.getAllPages()) {
PDPage page = (PDPage) pageObj;
PDResources resources = page.findResources();
resources.getImages().clear();
}
document.save("strippedOfImages.pdf");
}
}
It's supposed to remove all types of images (png, jpeg, ...). It should work like that:
.
You need to parse the document as follows:
public static void strip(String pdfFile, String pdfFileOut) throws Exception {
PDDocument doc = PDDocument.load(pdfFile);
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ ) {
PDPage page = (PDPage)pages.get( i );
// added
COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());
PDFStreamParser parser = new PDFStreamParser(page.getContents());
parser.parse();
List tokens = parser.getTokens();
List newTokens = new ArrayList();
for(int j=0; j<tokens.size(); j++) {
Object token = tokens.get( j );
if( token instanceof PDFOperator ) {
PDFOperator op = (PDFOperator)token;
if( op.getOperation().equals( "Do") ) {
//remove the one argument to this operator
// added
COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
// added
deleteObject(newDictionary, name);
continue;
}
}
newTokens.add( token );
}
PDStream newContents = new PDStream( doc );
ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
writer.writeTokens( newTokens );
newContents.addCompression();
page.setContents( newContents );
// added
PDResources newResources = new PDResources(newDictionary);
page.setResources(newResources);
}
doc.save(pdfFileOut);
doc.close();
}
// added
public static boolean deleteObject(COSDictionary d, COSName name) {
for(COSName key : d.keySet()) {
if( name.equals(key) ) {
d.removeItem(key);
return true;
}
COSBase object = d.getDictionaryObject(key);
if(object instanceof COSDictionary) {
if( deleteObject((COSDictionary)object, name) ) {
return true;
}
}
}
return false;
}

iText: split a PDF into several PDF (1 per page)

What I want is that: given a 10-pages-pdf-file, I want to display each page of that pdf inside a table on the web. What is the best way to achieve this? I guess one way is to split this 10-pages-pdf-file into 10 1-pages pdf, and programmatically display each pdf onto a row of a table. Can I do this with iText? Is there a better way to accomplish this?
From Split a PDF file (using iText)
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.pdf.PdfCopy;
import com.lowagie.text.pdf.PdfImportedPage;
import com.lowagie.text.pdf.PdfReader;
public class SplitPDFFile {
/**
* #param args
*/
public static void main(String[] args) {
try {
String inFile = args[0].toLowerCase();
System.out.println ("Reading " + inFile);
PdfReader reader = new PdfReader(inFile);
int n = reader.getNumberOfPages();
System.out.println ("Number of pages : " + n);
int i = 0;
while ( i < n ) {
String outFile = inFile.substring(0, inFile.indexOf(".pdf"))
+ "-" + String.format("%03d", i + 1) + ".pdf";
System.out.println ("Writing " + outFile);
Document document = new Document(reader.getPageSizeWithRotation(1));
PdfCopy writer = new PdfCopy(document, new FileOutputStream(outFile));
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
/* example :
java SplitPDFFile d:\temp\x\tx.pdf
Reading d:\temp\x\tx.pdf
Number of pages : 3
Writing d:\temp\x\tx-001.pdf
Writing d:\temp\x\tx-002.pdf
Writing d:\temp\x\tx-003.pdf
*/
}
}
Many iText examples here.
With PDDocument you can do so very easily.
You just have to use a Java List of PDDocument type and Splitter function to split a document.
List<PDDocument> Pages=new ArrayList<PDDocument>();
PDDocument.load(filePath);
try {
Splitter splitter = new Splitter();
Pages = splitter.split(document);
}
catch(Exception e) {
e.printStackTrace(); // print reason and line number where error exist
}
I can't comment, but this line in the most voted answer
Document document = new Document(reader.getPageSizeWithRotation(1));
should be
Document document = new Document(reader.getPageSizeWithRotation(i+1));
to get the correct pdf size if other pages have different page size (it know it's rare)

Categories