Reading text in pdf file and split into multiple pdf files

Reading text in pdf file and split into multiple pdf files - java

I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?
I tried the below code:
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(randomAccessFile);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This is resulting me blank text though my pdf is having data.

Related

extracting data by table/header using pdfbox

I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.
Here are the PDF examples:
I want to get the lists and amount of each list from the Summary Of Charges Header.
Based on the summary of charges, there are the amount and qty for each charges.
this is my code for the metadata which metadata variable comes out as null:
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
PDDocumentInformation info = document.getDocumentInformation();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
InputStream stream = metadata.createInputStream();
}
document.close();`
and this code basically just give me chunks of texts.
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
//InputStream stream = metadata.createInputStream();
System.out.print("Text:" + text);
}
document.close();

Creating PDF file in java using PDDocument results in corrupted PDF files

I'm trying to create temporary PDF files in Java using PDDocument. I'm employing the following method to create a temporary PDF file.
/* Create a temporary PDF file.*/
private File createPdf(String fileName) throws IOException {
final PDDocument document = new PDDocument();
final File file = File.createTempFile(fileName, ".pdf");
//write it
BufferedWriter bw = new BufferedWriter(new FileWriter(file));
bw.write("This is the temporary pdf file content");
bw.close();
document.save(file);
document.close();
return file;
}
This is the test.
#Test
public void testCreateAndMergePdfs() throws IOException {
Collection<File> pdfs = new ArrayList<>(Arrays.asList(createPdf("File1"), createPdf("File2")));
assertFalse(CollectionUtils.isEmpty(pdfs));
PdfPrintPojo pdfPrintPojo = new PdfPrintPojo(pdfs);
File mergedFile = service.createAndMergePDFs(pdfPrintPojo, "Merged");
assertNotNull(mergedFile);
List<File> list = new ArrayList<>(pdfs);
File file1 = list.get(0);
File file2 = list.get(1);
assertTrue(FileUtils.contentEquals(file1, file2));
}
What I'm trying to do here is to create and merge two PDF files. When I run the test, it creates two PDF files in the temp folder, for example, \AppData\Local\Temp\File16375814641476797612.pdf and \AppData\Local\Temp\File24102718409195239661.pdf and the merged file at \AppData\Local\Temp\Merged_merged_3755858389884894769.pdf. But the test fails at
assertTrue(FileUtils.contentEquals(file1, file2));
When I try to open the PDF files in the temp folder, it says that the PDF is corrupted. Also, I have no idea why the files are not being saved as File1 and File2. Can anyone help me with this?

Using Apache PDFBox tutorial, I managed to create a working PDF file(s). The method was changed as follows.
/* Create a temporary PDF file.*/
private File createPdf(String fileName) throws IOException {
// Create a document and add a page to it
final PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont(font, 12);
contentStream.newLineAtOffset(100, 700);
contentStream.showText("Hello World");
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
File file = File.createTempFile(fileName, ".pdf");
document.save(file);
document.close();
return file;
}
As for the test, I took the approach of using PDDocument to load the files, then extract data as String using PDFTextStripper and using assertions on those Strings.
#Test
public void testCreateAndMergePdfs() throws IOException {
Collection<File> pdfs = new ArrayList<>(Arrays.asList(createPdf("File1"), createPdf("File2")));
assertFalse(CollectionUtils.isEmpty(pdfs));
PdfPrintPojo pdfPrintPojo = new PdfPrintPojo(pdfs);
File mergedFile = service.createAndMergePDFs(pdfPrintPojo, "Merged");
assertNotNull(mergedFile);
List<File> list = new ArrayList<>(pdfs);
/* Load the PDF files and extract data as String. */
PDDocument document1 = PDDocument.load(list.get(0));
PDDocument document2 = PDDocument.load(list.get(1));
PDDocument merged = PDDocument.load(mergedFile);
PDFTextStripper stripper = new PDFTextStripper();
String file1Data = stripper.getText(document1);
String file2Data = stripper.getText(document2);
String mergedData = stripper.getText(merged);
/* Assert that data from file 1 and 2 are equal with each other and merged file. */
assertEquals(file1Data, file2Data);
assertEquals(file1Data + file2Data, mergedData);
}

The way you compare the file contents is a bit different, Could you try with below,
#Test
public void testCreateAndMergePdfs() {
Assert.assertEquals(FileUtils.readLines(file1), FileUtils.readLines(file2));
}
Or you can try
byte[] file1Bytes = Files.readAllBytes(Paths.get("Path to File 1"));
byte[] file2Bytes = Files.readAllBytes(Paths.get("Path to File 2"));
String file1 = new String(file1Bytes, StandardCharsets.UTF_8);
String file2 = new String(file2Bytes, StandardCharsets.UTF_8);
assertEquals("The content in the strings should match", file1, file2);
Or
File file1 = new File(file1);
File file2 = new File(file2);
assertThat(file1).hasSameContentAs(file2);

Convert html source to pdf using iText

I have an html file, which is generated as a response by my site.
While opening the html file in my browser, all the images and other elements are displaying properly. Moreover, I want to convert this html file into a pdf file.
So am using iText API to generate the pdf file.
In the html file images and texts are located horizontally.
But in the generated pdf output the images are displaying vertically.
code
String k = convertString();
OutputStream file = new FileOutputStream(new File("f:\\Test.pdf"));
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, file);
document.open();
InputStream is = new ByteArrayInputStream(k.getBytes());
XMLWorkerHelper.getInstance().parseXHtml(writer, document, is);
document.close();
file.close();
private static String convertString() {
StringBuilder contentBuilder = new StringBuilder();
try {
BufferedReader in = new BufferedReader(new FileReader("f://testPage.html"));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();
return content;
}
How can I generate a pdf that looks exactly the same as the appearance of the html UI?
Also, in Jfiddle, the html page is not displaying properly. What am I missing? Is the problem with my HTML, or is there some deeper issue with Jfiddle or something else at work?

How can we extract text content from PDF file without header and footer

How can we extract text content from PDF file, we are using pdfbox to extract text from PDF file but we are getting header and footer is not required. I am using following java code.
PDFTextStripper stripper = null;
try {
stripper = new PDFTextStripper();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
stripper.setStartPage(pageCount);
stripper.setEndPage(pageCount);
try {
String pageText = stripper.getText(document);
System.out.println(pageText);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

You have tagged this as an itext/itextpdf question, yet you are using PdfBox. That's confusing.
You also claim that your PDF file has headers and footers. This would imply that your PDF is a Tagged PDF and that the header and the footer are marked as artifacts. If that is the case, than you should take advantage of the Tagged nature of the PDF, and extract the PDF as is done in the ParseTaggedPdf example:
TaggedPdfReaderTool readertool = new TaggedPdfReaderTool();
PdfReader reader = new PdfReader(StructuredContent.RESULT);
readertool.convertToXml(reader, new FileOutputStream(RESULT));
reader.close();
If this doesn't result in anything, you clearly don't have a Tagged PDF in which case there are no headers and footers in your document from a technical point of view. You may see headers and footers with your human eyes, but that doesn't mean that a machine sees these headers and footers. To a machine, it's just text like any other text in the page.
The ExtractPageContentArea example shows how we can define a rectangle that excludes the header and the footer when parsing for the content.
PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();
In this case, we have examined the document manually and we noticed that the actual text is always added inside the rectangle new Rectangle(70, 80, 490, 580). The header is added above Y coordinate 580 and below coordinate 80. By using the RegionTextRenderFilter we can extract the content excluding the content that doesn't overlap with the rectangle we have defined.

Reading a particular page from a PDF document using PDFBox

How do I read a particular page (given a page number) from a PDF document using PDFBox?

This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );

//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}

Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}

you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();

Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading text in pdf file and split into multiple pdf files - java

Related

extracting data by table/header using pdfbox

Creating PDF file in java using PDDocument results in corrupted PDF files

Convert html source to pdf using iText

How can we extract text content from PDF file without header and footer

Reading a particular page from a PDF document using PDFBox

Categories

Resources