How to extract font styles of text contents using pdfbox?

How to extract font styles of text contents using pdfbox? - java

I am using pdfbox library to extract text contents from pdf file.I would able to extract all the text,but couldn't find the method to extract font styles.

This is not the right way to extract font. To read font one has to iterate through pdf pages and extract font as below:
PDDocument doc = PDDocument.load("C:/mydoc3.pdf");
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages){
Map<String,PDFont> pageFonts=page.getResources().getFonts();
}

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class pdf2box {
public static void main(String args[])
{
try
{
PDDocument pddDocument=PDDocument.load("table2.pdf");
PDFTextStripper textStripper=new PDFTextStripper();
System.out.println(textStripper.getText(pddDocument));
textStripper.getFonts();
pddDocument.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
}
}

File file = new File("sample.pdf");
PDDocument document = PDDocument.load(file);
for (int i = 0; i < document.getNumberOfPages(); ++i)
{
PDPage page = document.getPage(i);
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames())
{
PDFont font = res.getFont(fontName);
System.out.println(font.getName());
}
}

Related

No glyph found after getting Text and Font from existing pdf

My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..

PdfBox - change font or fontName in pdf file?

please tell me.
I have a pdf files with fonts HPDFAA+Arial-BoldMTBold. This font name incorrect and it's a subset...
I change fonts with library Asponse.pdf.dll, https://docs.aspose.com/pdf/net/replace-text-in-pdf/, paragraph - Replace fonts in existing PDF file, but this library trail version.
How can i do this with PDFBox? I want to replace this font on Arial-BoldMT or rename font name.
UPD: my attempts have led nowhere...In PDFontDescriptor i can rename font, but how i can apply for PDFont? Or i'm going the wrong way?
PDDocument pdfDocument = PDDocument.load(new File("Sample.pdf"));
PDPageTree pages = pdfDocument.getDocumentCatalog().getPages();
for (PDPage page : pages) {
PDResources res = page.getResources();
for (COSName fontName : res.getFontNames()) {
PDFont font = res.getFont(fontName);
PDFontDescriptor fontDescriptor = font.getFontDescriptor();
System.out.println("fontDes: " + fontDescriptor.getFontName());
String oldFontName = fontDescriptor.getFontName();
String newFontName = oldFontName.replace("Arial-BoldMTBold", "Arial-BoldMT");
fontDescriptor.setFontName(newFontName);
System.out.println("font: " + font.getName());
}

Here's code that is tailored to your file. It will only help you if this is about many similar files.
try (PDDocument doc = PDDocument.load(new File(XXX,"outerBox.pdf")))
{
PDPage page = doc.getPage(0);
for (COSName name : page.getResources().getFontNames())
{
PDFont font = page.getResources().getFont(name);
String fontName = font.getName();
if (font instanceof PDType0Font && fontName.endsWith("BoldMTBold"))
{
PDType0Font type0font = (PDType0Font) font;
String newFontName = fontName.substring(0, fontName.length() - 4);
type0font.getCOSObject().setString(COSName.BASE_FONT, newFontName);
PDCIDFont descendantFont = type0font.getDescendantFont();
descendantFont.getCOSObject().setString(COSName.BASE_FONT, newFontName);
PDFontDescriptor fontDescriptor = descendantFont.getFontDescriptor();
fontDescriptor.setFontName(newFontName);
}
}
doc.save(new File(XXX,"outerBox-saved.pdf"));
}
PDF structure, seen with PDFDebugger:

How to get image of a PDF page(included text). not images in a PDF page

I have tried a PDF page to image, But just extracted each images in the PDF page. not page image.
Below Code :
public class ExtractionPDFtoThumbImgs {
static String filePath = "/Users/tmdtjq/Downloads/PDFTest/test.pdf";
static String outputFilePath = "/Users/tmdtjq/Downloads/PDFTest/pageimages";
public static void change(File inputFile, File outputFolder) throws IOException {
//TODO check the input file exists and is PDF
//TODO for the treatment of PDF encrypted
PDDocument doc = null;
try {
doc = PDDocument.load(inputFile);
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i <allPages.size(); i++) {
PDPage page = allPages.get(i);
page.convertToImage();
BufferedImage image = page.convertToImage();
ImageIO.write(image, "jpg", new File(outputFolder.getAbsolutePath() + File.separator + (i + 1) + ".jpg"));
}
} finally {
if (doc != null) {
doc.close();
}
}
}
public static void main(String[] args) {
File inputFile = new File(ExtractionPDFtoThumbImgs.filePath);
File outputFolder = new File(ExtractionPDFtoThumbImgs.outputFilePath);
if(!outputFolder.exists()){
outputFolder.mkdirs();
}
try {
ExtractionPDFtoThumbImgs.change(inputFile, outputFolder);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Above code extract images in PDF page. not convert image in PDF page(included text).
Are there converting tools (PDF page to image) or Converting PDFBox class?
Please Suggest how to get image of a PDF page(included text). not to get images in a PDF page.

Try pdftocairo, it's part of poppler.
I was using imagemagick to convert PDF to images, but it relies on Ghostscript which is sometimes picky about the PDF you feed it so it was hit or miss...
So far pdftocairo has been solid.
http://poppler.freedesktop.org

Make Tess4J get image from PDF file

How to make Tess4J get image from PDF file?
I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.
File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf
I tested this example with Asprise, but I don't find any example like this on Tess4J
import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;
PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++) {
BufferedImage img = reader.getPageAsImage(i);
// recognizes both characters and barcodes
String text = new OCR().recognizeAll(image);
System.out.println("Page " + i + ": " + text);
}
reader.close(); // finally, close the file.

make use of pdfutilities.convertpdf2png and use it like you did before with images.

Tess4j has a dependency on pdfbox, so you can use this library. It could be something like this:
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
PDDocument document = PDDocument.load(new File("YOUR_PDF_FILE_PATH"));
PDFRenderer pdfRenderer = new PDFRenderer(document);
ITesseract tesseract = new Tesseract();
tesseract.setDatapath("tessdata");
tesseract.setLanguage("spa");
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
try {
String str = tesseract.doOCR(bufferedImage);
System.out.println(str);
} catch (TesseractException ex) {
Logger.getLogger(OCR.class.getName()).log(Level.SEVERE, null, ex);
}
}
document.close();
I'm using here Tessj4 4.5 and pdf-box 2.0.
You can also check
https://colwil.com/how-to-extract-text-from-a-scanned-pdf-using-ocr-in-java/.

Reading a particular page from a PDF document using PDFBox

How do I read a particular page (given a page number) from a PDF document using PDFBox?

This should work:
PDPage firstPage = (PDPage)doc.getAllPages().get( 0 );
as seen in the BookMark section of the tutorial
Update 2015, Version 2.0.0 SNAPSHOT
Seems this was removed and put back (?). getPage is in the 2.0.0 javadoc. To use it:
PDDocument document = PDDocument.load(new File(filename));
PDPage doc = document.getPage(0);
The getAllPages method has been renamed getPages
PDPage page = (PDPage)doc.getPages().get( 0 );

//Using PDFBox library available from http://pdfbox.apache.org/
//Writes pdf document of specific pages as a new pdf file
//Reads in pdf document
PDDocument pdDoc = PDDocument.load(file);
//Creates a new pdf document
PDDocument document = null;
//Adds specific page "i" where "i" is the page number and then saves the new pdf document
try {
document = new PDDocument();
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(i));
document.save("file path"+"new document title"+".pdf");
document.close();
}catch(Exception e){}

Thought I would add my answer here as I found the above answers useful but not exactly what I needed.
In my scenario I wanted to scan each page individually, look for a keyword, if that keyword appeared, then do something with that page (ie copy or ignore it).
I've tried to simply and replace common variables etc in my answer:
public void extractImages() throws Exception {
try {
String destinationDir = "OUTPUT DIR GOES HERE";
// Load the pdf
String inputPdf = "INPUT PDF DIR GOES HERE";
document = PDDocument.load( inputPdf);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
// Declare output fileName
String fileName = "output.pdf";
// Create output file
PDDocument newDocument = new PDDocument();
// Create PDFTextStripper - used for searching the page string
PDFTextStripper textStripper=new PDFTextStripper();
// Declare "pages" and "found" variable
String pages= null;
boolean found = false;
// Loop through each page and search for "SEARCH STRING". If this doesn't exist
// ie is the image page, then copy into the new output.pdf.
for(int i = 0; i < list.size(); i++) {
// Set textStripper to search one page at a time
textStripper.setStartPage(i);
textStripper.setEndPage(i);
PDPage returnPage = null;
// Fetch page text and insert into "pages" string
pages = textStripper.getText(document);
found = pages.contains("SEARCH STRING");
if (i != 0) {
// if nothing is found, then copy the page across to new output pdf file
if (found == false) {
returnPage = list.get(i - 1);
System.out.println("page returned is: " + returnPage);
System.out.println("Copy page");
newDocument.importPage(returnPage);
}
}
}
newDocument.save(destinationDir + fileName);
System.out.println(fileName + " saved");
}
catch (Exception e) {
e.printStackTrace();
System.out.println("catch extract image");
}
}

you can you getPage method over PDDocument instance
PDDocument pdDocument=null;
pdDocument = PDDocument.load(inputStream);
PDPage pdPage = pdDocument.getPage(0);

Here is the solution. Hope it will solve your issue.
string fileName="C:\mypdf.pdf";
PDDocument doc = PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
//above page number 1 to 2 will be parsed. for parsing only one page set both value same (ex:setStartPage(1); setEndPage(1);)
string reslut = stripper.getText(doc);
doc.close();

Add this to the command-line call:
ExtractText -startPage 1 -endPage 1 filename.pdf
Change 1 to the page number that you need.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract font styles of text contents using pdfbox? - java

I am using pdfbox library to extract text contents from pdf file.I would able to extract all the text,but couldn't find the method to extract font styles.

Related

No glyph found after getting Text and Font from existing pdf

PdfBox - change font or fontName in pdf file?

How to get image of a PDF page(included text). not images in a PDF page

Make Tess4J get image from PDF file

Reading a particular page from a PDF document using PDFBox

Categories

Resources