iText: split a PDF into several PDF (1 per page) - java

What I want is that: given a 10-pages-pdf-file, I want to display each page of that pdf inside a table on the web. What is the best way to achieve this? I guess one way is to split this 10-pages-pdf-file into 10 1-pages pdf, and programmatically display each pdf onto a row of a table. Can I do this with iText? Is there a better way to accomplish this?

From Split a PDF file (using iText)
import java.io.FileOutputStream;
import com.lowagie.text.Document;
import com.lowagie.text.pdf.PdfCopy;
import com.lowagie.text.pdf.PdfImportedPage;
import com.lowagie.text.pdf.PdfReader;
public class SplitPDFFile {
/**
* #param args
*/
public static void main(String[] args) {
try {
String inFile = args[0].toLowerCase();
System.out.println ("Reading " + inFile);
PdfReader reader = new PdfReader(inFile);
int n = reader.getNumberOfPages();
System.out.println ("Number of pages : " + n);
int i = 0;
while ( i < n ) {
String outFile = inFile.substring(0, inFile.indexOf(".pdf"))
+ "-" + String.format("%03d", i + 1) + ".pdf";
System.out.println ("Writing " + outFile);
Document document = new Document(reader.getPageSizeWithRotation(1));
PdfCopy writer = new PdfCopy(document, new FileOutputStream(outFile));
document.open();
PdfImportedPage page = writer.getImportedPage(reader, ++i);
writer.addPage(page);
document.close();
writer.close();
}
}
catch (Exception e) {
e.printStackTrace();
}
/* example :
java SplitPDFFile d:\temp\x\tx.pdf
Reading d:\temp\x\tx.pdf
Number of pages : 3
Writing d:\temp\x\tx-001.pdf
Writing d:\temp\x\tx-002.pdf
Writing d:\temp\x\tx-003.pdf
*/
}
}
Many iText examples here.

With PDDocument you can do so very easily.
You just have to use a Java List of PDDocument type and Splitter function to split a document.
List<PDDocument> Pages=new ArrayList<PDDocument>();
PDDocument.load(filePath);
try {
Splitter splitter = new Splitter();
Pages = splitter.split(document);
}
catch(Exception e) {
e.printStackTrace(); // print reason and line number where error exist
}

I can't comment, but this line in the most voted answer
Document document = new Document(reader.getPageSizeWithRotation(1));
should be
Document document = new Document(reader.getPageSizeWithRotation(i+1));
to get the correct pdf size if other pages have different page size (it know it's rare)

Related

How to save output from XML to PDF

I am using JAXB to unmarshall XML. Then I want to take some infos and write it to PDF format using iText. For some reason PDF is created but I can't open the file. I am also using ZFile as this should work on mainframes too, but this shouldn't be a problem here.
Probably I am doing something wrong when writing to PDF file. Here is my code:
package music;
import java.io.*;
import java.sql.Timestamp;
import java.util.Date;
import java.util.List;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import com.ibm.jzos.ZFile;
import java.io.FileOutputStream;
import com.itextpdf.text.*;
import com.itextpdf.text.pdf.BaseFont;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfWriter;
import music.Music.Artist;
import music.Music.Artist.Album;
import music.Music.Artist.Album.Description;
import music.Music.Artist.Album.Song;
public class MusicXml {
public static void main(String[] args) throws JAXBException, IOException {
ZFile inputZ = null, outputZ = null;
File inputW = null;
PdfWriter outputW = null;
PdfContentByte cb = null;
Document pdf = new Document(PageSize.A4);
Paragraph paragraf = new Paragraph();
// Font
Font fnt12n;
JAXBContext jaxb = null;
Unmarshaller unmarsh = null;
String line = null, sep = " ";
Music music;
Date date = new Date();
Date startDate = new Timestamp(date.getTime());
System.out.println("Start: " + startDate);
jaxb = JAXBContext.newInstance(ObjectFactory.class);
unmarsh = jaxb.createUnmarshaller();
String os = System.getProperty("os.name");
System.out.println("System: " + os);
boolean isWin = os.toLowerCase().contains("wind");
if (!isWin) {
// z/OS:
inputZ = new ZFile(args[0], "rt"); // "rt" - readtext
InputStream inpStream = inputZ.getInputStream();
InputStreamReader streamRdr = new InputStreamReader(inpStream, "CP870");
try {
outputW = PdfWriter.getInstance(pdf, (new ZFile(args[1], "wb")).getOutputStream());
} catch (DocumentException e) {
e.printStackTrace();
}
music = (Music) unmarsh.unmarshal(streamRdr);
} else {
// Windows:
inputW = new File(args[0]);
music = (Music) unmarsh.unmarshal(inputW);
try {
outputW = PdfWriter.getInstance(pdf, new FileOutputStream(args[1]));
} catch (DocumentException e) {
e.printStackTrace();
}
}
List<Artist> listaArtystow = music.getArtist();
for (Artist artysta : listaArtystow) {
List<Album> listaAlbumow = artysta.getAlbum();
for (Album album : listaAlbumow) {
Description opis = album.getDescription();
List<Song> listaPiosenek = album.getSong();
for (Song piosenka : listaPiosenek) {
String artistName = artysta.getName();
String albumName = album.getTitle();
int numberOfSongs = listaPiosenek.size();
String albumDescription = album.getDescription().getValue();
String songTitle = piosenka.getTitle();
String songDuration = piosenka.getLength();
line = songTitle + sep + songDuration;
FontFactory.register(args[2], "jakiesFonty");
Font font = FontFactory.getFont("jakiesFonty", BaseFont.CP1250, BaseFont.EMBEDDED);
BaseFont bf = font.getBaseFont();
fnt12n = new Font(bf, 12f, Font.NORMAL, BaseColor.BLACK);
// PDF
outputW.setPdfVersion(PdfWriter.VERSION_1_7);
pdf.addTitle("Musical collection");
pdf.addAuthor("Natalia Nazaruk");
pdf.addSubject("Cwiczenie tworzenia PDF z XML");
pdf.addKeywords("Metadata, Java, iText, PDF");
pdf.addCreator("Program: MusicXML");
pdf.setMargins(60, 60, 50, 40);
pdf.open();
pdf.newPage();
try {
paragraf.setAlignment(Element.ALIGN_JUSTIFIED);
paragraf.setSpacingAfter(16f);
paragraf.setLeading(14f);
paragraf.setFirstLineIndent(30f);
paragraf.setFont(fnt12n);
pdf.add(new Paragraph(line, fnt12n));
} catch (DocumentException e) {
e.printStackTrace();
}
}
}
}
date = new Date();
Date stopDate = new Timestamp(date.getTime());
System.out.println("Stop: " + stopDate);
long diffInMs = stopDate.getTime() - startDate.getTime();
float diffInSec = diffInMs / 1000.00f;
System.out.format("Czas przetwarzenia pliku XML: %.2f s.", diffInSec);
System.exit(0);
if (isWin) {
outputW.close();
} else
outputZ.close();
}
}
Apart from the fact that you chose to use an old version of iText, there are a couple of other things wrong with your code. Which documentation did you read? I don't think you've already discovered the official iText web site, otherwise you would have used iText 7 instead of iText 5, and you would have known that no valid document is created if you never close the Document object.
The short answer is that you forgot:
pdf.close();
I see that you close the output stream:
if (isWin) {
outputW.close();
} else
outputZ.close();
}
That doesn't really make sense, because at that point, the PDF hasn't been finalized (for instance: no cross-reference table was created). When you close the document, the underlying output stream is closed implicitly (unless you tell iText explicitly not to do this).
There's also something awkward about the loops you are creating:
List<Artist> listaArtystow = music.getArtist();
for (Artist artysta : listaArtystow) {
...
for (Album album : listaAlbumow) {
...
for (Song piosenka : listaPiosenek) {
...
FontFactory.register(args[2], "jakiesFonty");
Font font = FontFactory.getFont("jakiesFonty", BaseFont.CP1250, BaseFont.EMBEDDED);
BaseFont bf = font.getBaseFont();
fnt12n = new Font(bf, 12f, Font.NORMAL, BaseColor.BLACK);
// PDF
outputW.setPdfVersion(PdfWriter.VERSION_1_7);
pdf.addTitle("Musical collection");
pdf.addAuthor("Natalia Nazaruk");
pdf.addSubject("Cwiczenie tworzenia PDF z XML");
pdf.addKeywords("Metadata, Java, iText, PDF");
pdf.addCreator("Program: MusicXML");
pdf.setMargins(60, 60, 50, 40);
pdf.open();
pdf.newPage();
...
}
}
}
output.close();
You create the same font over and over again. One PDF can only have 1 version (in your case PDF-1.7) and 1 set of metadata, yet you define that version and metadata over and over again. Finally, you open the document many times whereas you only need to open it once.
This makes more sense:
FontFactory.register(args[2], "jakiesFonty");
Font font = FontFactory.getFont("jakiesFonty", BaseFont.CP1250, BaseFont.EMBEDDED);
BaseFont bf = font.getBaseFont();
fnt12n = new Font(bf, 12f, Font.NORMAL, BaseColor.BLACK);
// PDF
outputW.setPdfVersion(PdfWriter.VERSION_1_7);
pdf.addTitle("Musical collection");
pdf.addAuthor("Natalia Nazaruk");
pdf.addSubject("Cwiczenie tworzenia PDF z XML");
pdf.addKeywords("Metadata, Java, iText, PDF");
pdf.addCreator("Program: MusicXML");
pdf.setMargins(60, 60, 50, 40);
pdf.open();
List<Artist> listaArtystow = music.getArtist();
for (Artist artysta : listaArtystow) {
...
for (Album album : listaAlbumow) {
...
for (Song piosenka : listaPiosenek) {
...
pdf.newPage();
...
}
}
}
pdf.close();
As you can see, you open() the Document instance pdf before the loop, to write the PDF headers, and you close() the Document after the loop to write some objects (e.g. fonts), the cross-reference table, and the PDF trailer. As you don't have pdf.close() in your code, all that necessary information is missing from your PDF.
Since you are new at iText, I would highly recommend you not to use versions older than iText 7. You may have discovered that the latest iText 5 release is iText 5.5.13, but that's a maintenance release. In maintenance releases, we only provide bug fixes for our paying customers; we don't add new functionality. For instance: the new PDF specification ISO 32000-2 (aka PDF 2.0) is only available from iText 7.1 on. We won't support PDF 2.0 in older versions.
If you go to the official web site, you'll notice that iText 7.1.1 is the most recent version (iText 7 download page). Where did you find iText, and how come you selected an old version? (This isn't a rhetorical question: we'd like to know to find out how we can improve our web site. We also want to know why so many people post such bad code on Stack Overflow; it's as if they can't find the tutorials. That's sad, because we're investing plenty of time and money in those tutorials. (But if no one is reading them, what's the point???)
You can find more info about iText 7 in the Jump-Start tutorial and the Building Blocks tutorial.
As for converting XML to PDF, why don't you convert to HTML first, and then use the pdfHTML add-on? There's an example on how to do that in chapter 4 of the HTML to PDF tutorial as well as in the ZUGFeRD tutorial.

How to split pdf file by result in java pdfbox

I hve one pdf file, which contain 60 pages. In each pages I've unique and repeated Invoice Nos. Im using Apache PDFBOX.
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;
public class PDFtest1 {
public static void main(String[] args){
PDDocument pd;
try {
File input = new File("G:\\Sales.pdf");
// StringBuilder to store the extracted text
StringBuilder sb = new StringBuilder();
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
// Add text to the StringBuilder from the PDF
sb.append(stripper.getText(pd));
Pattern p = Pattern.compile("Invoice No.\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(sb);
while (m.find()){
// group() method refers to the next number that follows the pattern we have specified.
System.out.println(m.group());
}
if (pd != null) {
pd.close();
}
} catch (Exception e){
e.printStackTrace();
}
}
}
I'm able to read all Invoice Nos. using java regex.
Finally the Result is as follow
run:
Invoice No. D0000003010
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003011
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003012
Invoice No. D0000003013
Invoice No. D0000003013
Invoice No. D0000003014
Invoice No. D0000003014
Invoice No. D0000003015
Invoice No. D0000003016
I need to split the pdf according to tht Invoice No.s. For example Invoice No. D0000003011, all pdf pages should be merge as a single pdf and so on.
Hw can i achive dis. ..
public static void main(String[] args) throws IOException, COSVisitorException
{
File input = new File("G:\\Sales.pdf");
PDDocument outputDocument = null;
PDDocument inputDocument = PDDocument.loadNonSeq(input, null);
PDFTextStripper stripper = new PDFTextStripper();
String currentNo = null;
for (int page = 1; page <= inputDocument.getNumberOfPages(); ++page)
{
stripper.setStartPage(page);
stripper.setEndPage(page);
String text = stripper.getText(inputDocument);
Pattern p = Pattern.compile("Invoice No.(\\s\\w\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d)");
// Matcher refers to the actual text where the pattern will be found
Matcher m = p.matcher(text);
String no = null;
if (m.find())
{
no = m.group(1);
}
System.out.println("page: " + page + ", value: " + no);
PDPage pdPage = (PDPage) inputDocument.getDocumentCatalog().getAllPages().get(page - 1);
if (no != null && !no.equals(currentNo))
{
saveCloseCurrent(currentNo, outputDocument);
// create new document
outputDocument = new PDDocument();
currentNo = no;
}
if (no == null && currentNo == null)
{
System.out.println ("header page ??? " + page + " skipped");
continue;
}
// append page to current document
outputDocument.importPage(pdPage);
}
saveCloseCurrent(currentNo, outputDocument);
inputDocument.close();
}
private static void saveCloseCurrent(String currentNo, PDDocument outputDocument)
throws IOException, COSVisitorException
{
// save to new output file
if (currentNo != null)
{
// save document into file
File f = new File(currentNo + ".pdf");
if (f.exists())
{
System.err.println("File " + f + " exists?!");
System.exit(-1);
}
outputDocument.save(f);
outputDocument.close();
}
}
Beware:
this has not been tested with your file (because I don't have it);
the code makes the assumption that identical invoice numbers are always together;
your regular expression has been changed slightly;
make sure that the first and the last PDF files are correct, and check a few at random, and with different viewers if available;
verify that the total count of files is as expected;
the summed up size of all files will be bigger than the source file, this is because of the font resources;
use the 1.8.10 version. Don't use PDFBox 0.7.3.jar at the same time!
error handling is very basic, you need to change it;
update 19.8.2015:
it now supports pages with no invoice number, these will be appended.

How to create a new PDF file if a file name already exists?

My code below outputs a simple receipt which contains some details from the user like name, fare and stop number. This generates a PDF file containing those details. Whenever a new user inputs data in the main form, this just overwrite the data of the first user in the PDF file. How can I be able to create a new PDF file without appending or overwriting the original data of the first user? (like sample.pdf, sample2.pdf, sample3.pdf...and so on)
public class PDFDisplay {
public static void generatePDF(PassengerBean passengerBean) {
Document document = new Document();
try {
final Chunk NEWLINE = new Chunk("\n");
PdfWriter.getInstance(document,
new FileOutputStream("C://sample.pdf"));
document.open();
Image img = Image.getInstance("C:\\Documents and Settings\\Pinky\\My Documents\\Angel's files\\ICS 113\\eclipse_ws\\MRTApplicationIteration2\\WebContent\\image\\mrt.jpg");
document.add(img);
String or = "Official Receipt";
String hr = "-----------------------------------------------------------";
String spacer = " ";
String name = "Passenger Name: " + passengerBean.lname + "," + " " + passengerBean.fname;
String dest = "Destination: " + passengerBean.dest + " STATION";
String stopno = passengerBean.stop;
double fare = passengerBean.fare;
String fare1 = "Fare: PHP" + " " + String.valueOf(fare);
String ccnum = "CREDIT CARD NUMBER: " + " " + "************" + passengerBean.ccnum.substring(Math.max(0, passengerBean.ccnum.length() - 4));
Paragraph para10 = new Paragraph(32);
para10.setSpacingBefore(10);
para10.setSpacingAfter(10);
para10.add(new Chunk(or));
document.add(para10);
Paragraph para9 = new Paragraph(32);
para9.setSpacingBefore(30);
para9.setSpacingAfter(10);
para9.add(new Chunk(hr));
document.add(para9);
// Setting paragraph line spacing to 32
Paragraph para1 = new Paragraph(32);
para1.setSpacingBefore(5);
para1.setSpacingAfter(10);
para1.add(new Chunk(name));
document.add(para1);
Paragraph para2 = new Paragraph();
para2.setSpacingAfter(10);
para2.add(new Chunk(dest));
document.add(para2);
Paragraph para3 = new Paragraph();
para3.setSpacingAfter(10);
para3.add(new Chunk(stopno));
document.add(para3);
Paragraph para4 = new Paragraph();
para4.setSpacingAfter(10);
para4.add(new Chunk(fare1));
document.add(para4);
Paragraph para5 = new Paragraph();
para5.setSpacingAfter(10);
para5.add(new Chunk(ccnum));
document.add(para5);
document.close();
} catch (DocumentException e) {
e.printStackTrace();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Almost all the methods you might need to achieve what you want can be found in the Java API documentation for the File class
You want to create a unique file that starts with sample and ends with pdf. To achieve this, you can use the createTempFile() method. This question was already answered on StackOverflow 6 years ago: What is the best way to generate a unique and short file name in Java
Suppose that you really want to have incremental numbers in your file name, e.g. sample0001.pdf, sample0002.pdf, sample0003.pdf and so on, then you can use the list() method. This returns an array of String values with the names of all files in a directory. I suggest that you use a FilenameFilter so that you only get the PDF files starting with sample. You could then sort these names to find the name with the highest number. See How to list latest files in a directory using FileNameFilter to find out how to create such a filter.
Once you have the file name with the highest number, it's only a matter of String manipulation to create a new filename. Use that filename (or that File instance) when you define the OutputStream.
As you can see, this answer doesn't mention iText anywhere and although the extension of the files we create or list is .pdf, it has nothing to do with PDF or PDF generation either. It's a pure Java question.

Read table from docx file using Apache POI

I am able to read tables from doc file. (see following code)
public String readDocFile(String filename, String str) {
try {
InputStream fis = new FileInputStream(filename);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
boolean intable = false;
boolean inrow = false;
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph par = range.getParagraph(i);
//System.out.println("paragraph "+(i+1));
//System.out.println("is in table: "+par.isInTable());
//System.out.println("is table row end: "+par.isTableRowEnd());
//System.out.println(par.text());
if (par.isInTable()) {
if (!intable) {//System.out.println("New table creating"+intable);
str += "<table border='1'>";
intable = true;
}
if (!inrow) {//System.out.println("New row creating"+inrow);
str += "<tr>";
inrow = true;
}
if (par.isTableRowEnd()) {
inrow = false;
} else {
//System.out.println("New text adding"+par.text());
str += "<td>" + par.text() + "</td>";
}
} else {
if (inrow) {//System.out.println("Closing Row");
str += "</tr>";
inrow = false;
}
if (intable) {//System.out.println("Closing Table");
str += "</table>";
intable = false;
}
str += par.text() + "<br/>";
}
}
} catch (Exception e) {
System.out.println("Exception: " + e);
}
return str;
}
Can anyone suggest me how can I do the same with docx file ?
I tried to do that. But could not locate a replacement of 'Range' class.
Please help.
By popular request, promoting a comment to an answer...
In the Apache POI code examples, you can find the XWPF SimpleTable example
This shows how to create a simple table, and how to create one with lots of fancy styling.
Assuming you just want a simple table from scratch, in a brand new workbook, then the code you need goes along the lines of:
// Start with a new document
XWPFDocument doc = new XWPFDocument();
// Add a 3 column, 3 row table
XWPFTable table = doc.createTable(3, 3);
// Set some text in the middle
table.getRow(1).getCell(1).setText("EXAMPLE OF TABLE");
// table cells have a list of paragraphs; there is an initial
// paragraph created when the cell is created. If you create a
// paragraph in the document to put in the cell, it will also
// appear in the document following the table, which is probably
// not the desired result.
XWPFParagraph p1 = table.getRow(0).getCell(0).getParagraphs().get(0);
XWPFRun r1 = p1.createRun();
r1.setBold(true);
r1.setText("The quick brown fox");
r1.setItalic(true);
r1.setFontFamily("Courier");
r1.setUnderline(UnderlinePatterns.DOT_DOT_DASH);
r1.setTextPosition(100);
// And at the end
table.getRow(2).getCell(2).setText("only text");
// Save it out, to view in word
FileOutputStream out = new FileOutputStream("simpleTable.docx");
doc.write(out);
out.close();
The following snippet uses Apache POI 5.0.0, and it works well when reading docx table data
public void readDocxTables(String docxFilePath) throws FileNotFoundException, IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream(docxFilePath));
for(XWPFTable table : doc.getTables()) {
for(XWPFTableRow row : table.getRows()) {
for(XWPFTableCell cell : row.getTableCells()) {
System.out.println("cell text: " + cell.getText());
}
}
}
}
This is not Apache POI, but using third party component found it much easier.
An example how to get tables from a docx file.
Of course, just idea if you do not find solution with the POI,

Splitting one Pdf file to multiple according to the file-size

I have been trying to split one big PDF file to multiple pdf files based on its size. I was able to split it but it only creates one single file and rest of the file data is lost. Means it does not create more than one files to split it. Can anyone please help? Here is my code
public static void main(String[] args) {
try {
PdfReader Split_PDF_By_Size = new PdfReader("C:\\Temp_Workspace\\TestZip\\input1.pdf");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream("C:\\Temp_Workspace\\TestZip\\File1.pdf"));
document.open();
int number_of_pages = Split_PDF_By_Size.getNumberOfPages();
int pagenumber = 1; /* To generate file name dynamically */
// int Find_PDF_Size; /* To get PDF size in bytes */
float combinedsize = 0; /* To convert this to Kilobytes and estimate new PDF size */
for (int i = 1; i < number_of_pages; i++ ) {
float Find_PDF_Size;
if (combinedsize == 0 && i != 1) {
document = new Document();
pagenumber++;
String FileName = "File" + pagenumber + ".pdf";
copy = new PdfCopy(document, new FileOutputStream(FileName));
document.open();
}
copy.addPage(copy.getImportedPage(Split_PDF_By_Size, i));
Find_PDF_Size = copy.getCurrentDocumentSize();
combinedsize = (float)Find_PDF_Size / 1024;
if (combinedsize > 496 || i == number_of_pages) {
document.close();
combinedsize = 0;
}
}
System.out.println("PDF Split By Size Completed. Number of Documents Created:" + pagenumber);
}
catch (Exception i)
{
System.out.println(i);
}
}
}
(BTW, it would have been great if you had tagged your question with itext, too.)
PdfCopy used to close the PdfReaders it imported pages from whenever the source PdfReader for page imports switched or the PdfCopy was closed. This was due to the original intended use case to create one target PDF from multiple source PDFs in combination with the fact that many users forget to close their PdfReaders.
Thus, after you close the first target PdfCopy, the PdfReader is closed, too, and no further pages are extracted.
If I interpret the most recent checkins into the iText SVN repository correctly, this implicit closing of PdfReaders is in the process of being removed from the code. Therefore, with one of the next iText versions, your code may work as intended.

Categories