extracting data by table/header using pdfbox - java

I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.
Here are the PDF examples:
I want to get the lists and amount of each list from the Summary Of Charges Header.
Based on the summary of charges, there are the amount and qty for each charges.
this is my code for the metadata which metadata variable comes out as null:
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
PDDocumentInformation info = document.getDocumentInformation();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
InputStream stream = metadata.createInputStream();
}
document.close();`
and this code basically just give me chunks of texts.
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
//InputStream stream = metadata.createInputStream();
System.out.print("Text:" + text);
}
document.close();

Related

Trying Merge document with PDFMergerUtility in pdfbox 2.00

We have a PDF doc of 10 pages. We are in a need to rearrange the pages and split them into 3 or 4 documents. We were using Pdfbox Merge Document with 1.8.xx as like mergePdf.mergeDocuments() it working fine .now pdfbox version 2.0.0 the sequence is going wrong post rearranging merging 10 pages into 3 separate document.I have tried both setuptemp and setupmain. Both are not providing any positive inputs as well.
Pdfbox 1.8 code sample:
PDDocument document = PDDocument.load(new File(sourceFile));
PDFMergerUtility PDFmerger = new PDFMergerUtility();
Splitter splitter = new Splitter();
splitter.setStartPage(fDStartPage);
splitter.setSplitAtPage((fDEndPage));
List<PDDocument> splittedDocuments = splitter.split(document);
PDFmerger.addSource(getInputStream(splittedDocuments.get(0)));
PDFmerger.setDestinationFileName(destinationFile);
PDFmerger.mergeDocuments();
PDFbox 2.0 code sample:
PDFMergerUtility pdfmerger = new PDFMergerUtility();
PDDocument document = PDDocument.load(new File(filename));
pdfmerger.setDestinationFileName(mergedFileName);
Splitter splitter = new Splitter();
splitter.setStartPage(9);
splitter.setSplitAtPage(10);
List<PDDocument> document1 splitter.split(document);
InputStream is = null;
ByteArrayOutputStream out = new ByteArrayOutputStream();
document1.get(0).save(out);
byte[] data = out.toByteArray();
is = new ByteArrayInputStream(data);
pdfmerger.addSource(is);
pdfmerger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
document.close();

Printing Chinese characters in pdfbox

I'm using the following set-up:
Java 11.0.1
pdfbox 2.0.15
Objective: Rendering a pdf that contains Chinese characters
Problem: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding
I already tried:
Using different fonts for Chinese character support. The latest one is NotoSansCJKtc-Regular.ttf
Set font to unicode as described here: Java: Write national characters to PDF using PDFBox, however the used loadTTF method is deprecated.
Using Arial-Unicode-MS_4302.ttf
My code looks like this (shortened a bit):
try (InputStream pdfIn = inputStream; PDDocument pdfDocument =
PDDocument.load(pdfIn)) {
PDFont formFont;
//Check if Chinese characters are present
if (!Util.containsHanScript(queryString)) {
formFont = PDType0Font.load(pdfDocument,
PdfReportGenerator.class.getResourceAsStream("LiberationSans-Regular.ttf"),
false);
} else {
formFont = PDType0Font.load(pdfDocument,
PdfReportGenerator.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf"),
false);
}
List<PDField> fields = acroForm.getFields();
//Load fields into Map
Map<String, PDField> pdfFields = new HashMap<>();
for (PDField field : fields) {
String key = field.getPartialName();
pdfFields.put(key, field);
}
PDField currentField = pdfFields.get("someFieldID");
PDVariableText pdfield = (PDVariableText) currentField;
PDResources res = acroForm.getDefaultResources();
String fontName = res.add(formFont).getName();
String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";
pdfield.setDefaultAppearance(defaultAppearanceString);
pdfield.setValue("李柱");
acroForm.flatten(fields, true);
ByteArrayOutputStream pdfOut = new ByteArrayOutputStream();
pdfDocument.save(pdfOut);
}
Expected result: Chinese characters on pdf.
Actual result: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding
So my question is about how to best support rendering of Chinese characters with pdfbox. Any help is appreciated.
The following code works for me, it uses the file of PDFBOX-4629:
PDDocument doc = PDDocument.load(new URL("https://issues.apache.org/jira/secure/attachment/12977270/Report_Template_DE.pdf").openStream());
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDVariableText field = (PDVariableText) acroForm.getField("search_query");
List<PDField> fields = acroForm.getFields();
PDFont font = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arialuni.ttf"), false);
PDResources res = acroForm.getDefaultResources();
String fontName = res.add(font).getName();
String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";
field.setDefaultAppearance(defaultAppearanceString);
field.setValue("李柱");
acroForm.flatten(fields, true);
doc.save("saved.pdf");
doc.close();

Reading text in pdf file and split into multiple pdf files

I have a consolidated pdf files which has text in each page with id and page numbers as "Page X of Y". I am in need of splitting one pdf file into multiple pdf files based on Page X of Y text. I am trying to do POC using iText but I am struggling to read Page X of Y to identify the page numbers which I need to use to split the file. May I get some light on implementing this using Java?
I tried the below code:
public static void main(String args[]) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:\\basics\\outbound\\FPPStmts.pdf");
try {
// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead
RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(randomAccessFile);
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(2);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
This is resulting me blank text though my pdf is having data.

Reading barcode from PDF using PDF Box

I am having a pdf with barcodes. As shown in the image
I am stuck on an issue and not able to proceed with my project. I am using PDFBox for parsing of the PDF and able to convert the whole pdf in the text format as shown in the code below:
public static PdfValues readPdf() throws IOException {
System.out.println("Main Method Started");
File file = new File("D:/po/temp/output.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
text = text.trim();
text = text.replaceAll(" +", " ");
text = text.replaceAll("(?m)^[ \t]*\r?\n", "");
// System.out.println(text);
deleteIfExist();
writeToFile(text);
PdfValues infos = readData();
document.close();
System.out.println("Main Method Ended");
return infos;
}
But I am not getting the barcode value which means it is not a text. Can anyone help me how to parse this barcode values as an image or actual value? Thank you for reading this question.

Reading a particular page from a PDF document using PDFBox

How do I read a particular page (given a page number) from a PDF document using PDFBox?
This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );
//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}
Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}
you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);
Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();
Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

Categories