I am having a pdf with barcodes. As shown in the image
I am stuck on an issue and not able to proceed with my project. I am using PDFBox for parsing of the PDF and able to convert the whole pdf in the text format as shown in the code below:
public static PdfValues readPdf() throws IOException {
System.out.println("Main Method Started");
File file = new File("D:/po/temp/output.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
text = text.trim();
text = text.replaceAll(" +", " ");
text = text.replaceAll("(?m)^[ \t]*\r?\n", "");
// System.out.println(text);
deleteIfExist();
writeToFile(text);
PdfValues infos = readData();
document.close();
System.out.println("Main Method Ended");
return infos;
}
But I am not getting the barcode value which means it is not a text. Can anyone help me how to parse this barcode values as an image or actual value? Thank you for reading this question.
Related
I am trying to extract data from PDF by header/table. I am not sure if the PDF data is considered to be header or table. I tried to find if there are any metadata in the PDF, but it is null instead.
Here are the PDF examples:
I want to get the lists and amount of each list from the Summary Of Charges Header.
Based on the summary of charges, there are the amount and qty for each charges.
this is my code for the metadata which metadata variable comes out as null:
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
PDDocumentInformation info = document.getDocumentInformation();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDMetadata metadata = catalog.getMetadata();
InputStream stream = metadata.createInputStream();
}
document.close();`
and this code basically just give me chunks of texts.
PDDocument document = PDDocument.load(new File("invoice.pdf"));
if(!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);
String text = stripper.getText(document);
//InputStream stream = metadata.createInputStream();
System.out.print("Text:" + text);
}
document.close();
My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..
I am trying to convert a .txt file into a .pdf file using iText library.
The problem that I am facing is the following:
I have a clear formatting in the txt file, something similar with this:
TEXT *******************
Other text here * SOME_CODE_HERE_ *
Other text *******************
And in the output the formatting is gone and looks like this:
TEXT ******************
Other text here * SOME_CODE_HERE_ *
Other text ******************
The code looks like this:
public static boolean convertTextToPDF(File file) throws Exception {
BufferedReader br = null;
try {
Document pdfDoc = new Document(PageSize.A4);
String output_file = file.getName().replace(".txt", ".pdf");
System.out.println("## writing to: " + output_file);
PdfWriter.getInstance(pdfDoc, new FileOutputStream(output_file)).setPdfVersion(PdfWriter.VERSION_1_7);;
pdfDoc.open();
Font myfont = new Font();
myfont.setStyle(Font.NORMAL);
myfont.setSize(11);
pdfDoc.add(new Paragraph("\n"));
if (file.exists()) {
br = new BufferedReader(new FileReader(file));
String strLine;
while ((strLine = br.readLine()) != null) {
Paragraph para = new Paragraph(strLine + "\n", myfont);
para.setAlignment(Element.ALIGN_JUSTIFIED);
pdfDoc.add(para);
}
} else {
System.out.println("no such file exists!");
return false;
}
pdfDoc.close();
}
catch (Exception e) {
e.printStackTrace();
} finally {
if (br != null)
br.close();
}
return true;
}
I also tried to create a BaseFont with IDENTITY_H but it doesn't work.
I guess it's about the encoding or something like that.
What do you think? I run out of solutions...
Thanks
LE:
As suggested by Alan, and by the tutorial from iText's page, I used this part in addition with my existing code and it works fine.
BaseFont courier = BaseFont.createFont(BaseFont.COURIER, BaseFont.CP1252, BaseFont.EMBEDDED);
Font myfont = new Font(courier);
You need to use a Monospaced Font e.g. Courier.
http://en.wikipedia.org/wiki/Monospaced_font
http://itextpdf.com/examples/iia.php?id=208
I know this is old, but i had the same problem in converting text files into pdf's and i used this (i wrote this in vb.net):
Dim pdfDoc As Document = New Document(PageSize.A4)
Dim pdfWrite As PdfWriter = PdfWriter.GetInstance(pdfDoc, New FileStream(pdfFoldername & "\" & "name of file", FileMode.Create))
pdfDoc.Open()
Dim courier As BaseFont = BaseFont.CreateFont(BaseFont.COURIER, BaseFont.CP1252, BaseFont.EMBEDDED)
Dim myfont As iTextSharp.text.Font = New iTextSharp.text.Font(courier, 10)
Dim para As Paragraph = New Paragraph(page, myfont)
pdfDoc.Add(para)
The difference to the above answer and updated code is using '10' as my font size. That made the PDF look identical to the formatting from the text file.
In order to get beautiful result, you should use a PDF generating library supporting "Word-like" tabulation (left, right and center) by absolute position in points or percent.
Here I used PDFFlow library to move text to the exact position (for example, text "(Name)" is placed at the position 80% from left page side):
DocumentBuilder
.New()
.AddSection()
.AddParagraph()
// Create a line from X=0 to X=20%:
.AddTabSymbol().AddTabulationInPercent(20, TabulationType.Right, TabulationLeading.BottomLine)
// Create a line from X=60% to X=100%:
.AddTabSymbol().AddTabSymbol().AddTabulationInPercent(60).AddTabulationInPercent(100, TabulationType.Right, TabulationLeading.BottomLine)
.ToSection()
.AddParagraph()
// Add text under the first line in the center (position 10%):
.AddTabSymbol().AddTextToParagraph("(Date)").AddTabulationInPercent(10, TabulationType.Center)
// Add text under the second line in the center (position 80%):
.AddTabSymbol().AddTextToParagraph("(Name)").AddTabulationInPercent(80, TabulationType.Center)
.ToDocument().Build("Result.pdf");
You can also set tabulation in points. But in case you need to change PaperSize and Orientation dynamically, better to use percent, and the library will calculate position in points automatically.
Here is a tutorial with more examples: Adding Tabulation to PDF.
I have tried a PDF page to image, But just extracted each images in the PDF page. not page image.
Below Code :
public class ExtractionPDFtoThumbImgs {
static String filePath = "/Users/tmdtjq/Downloads/PDFTest/test.pdf";
static String outputFilePath = "/Users/tmdtjq/Downloads/PDFTest/pageimages";
public static void change(File inputFile, File outputFolder) throws IOException {
//TODO check the input file exists and is PDF
//TODO for the treatment of PDF encrypted
PDDocument doc = null;
try {
doc = PDDocument.load(inputFile);
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i <allPages.size(); i++) {
PDPage page = allPages.get(i);
page.convertToImage();
BufferedImage image = page.convertToImage();
ImageIO.write(image, "jpg", new File(outputFolder.getAbsolutePath() + File.separator + (i + 1) + ".jpg"));
}
} finally {
if (doc != null) {
doc.close();
}
}
}
public static void main(String[] args) {
File inputFile = new File(ExtractionPDFtoThumbImgs.filePath);
File outputFolder = new File(ExtractionPDFtoThumbImgs.outputFilePath);
if(!outputFolder.exists()){
outputFolder.mkdirs();
}
try {
ExtractionPDFtoThumbImgs.change(inputFile, outputFolder);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Above code extract images in PDF page. not convert image in PDF page(included text).
Are there converting tools (PDF page to image) or Converting PDFBox class?
Please Suggest how to get image of a PDF page(included text). not to get images in a PDF page.
Try pdftocairo, it's part of poppler.
I was using imagemagick to convert PDF to images, but it relies on Ghostscript which is sometimes picky about the PDF you feed it so it was hit or miss...
So far pdftocairo has been solid.
http://poppler.freedesktop.org
How do I read a particular page (given a page number) from a PDF document using PDFBox?
This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );
//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}
Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}
you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);
Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();
Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.