For my work, I need to translate pdf document into Image with PDFBox.
PDDocument document = PDDocument.load(new File(fileUrl));
PDFRenderer pdfRenderer=new PDFRenderer(document);
BufferedImage bim=pdfRenderer.renderImageWithDPI(page, dpi.floatValue());
My document have many optional content group that are not visible (with Acrobat Reader for example) but after rendering my image contains this ocg.
How render pdf document without render all ocg ?
Related
I need to OCR a readable pdf again to generate hOCR file.
I am using,
Apache PDFBox -> parse the readable pdf
Qoppa -> OCR the pdf
Currently, I am trying iText to create a pdf after converting PDF page to image using PDFRenderer from PDfbox.
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("iTextImageExample.pdf"));
document.open();
Image img = Image.getInstance("C:\\Projects\\Data Mapping\\singularity-data-jockey-pdflib\\src\\test\\resources\\data\\pageimg.jpg");
document.add(img);
I also tried with Qoppa PDfDocument class.
The issue is, the output pdf generated from the image is low quality and OCR is not accurate. Is there any other way to convert the pdf to non searchable pdf?
I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}
Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'
PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.
I have a pdf with AcroForm and need to fill it with string that contains different languages glyphs (English, Chinese, Korean, Khmer).
In iText5 I've used:
AcroFields form = stamper.getAcroFields();
form.addSubstitutionFont(arialFont);
form.addSubstitutionFont(khmerFont);
And it worked fine for Chinese and Korean, but I've faced an issue with Khmer ligatures not being rendered. Found out that I need pdfCalligraph addon to make ligatures work, but it comes with iText7 only. I've managed to add paragraphs with proper Khmer ligatures rendering (requiring typography as a dependency and loading a license key). But in AcroForm it won't do it automatically. I'm struggling to find an iText7 version of addSubstitutionFont and make it work with pdfCalligraph.
Code I've used with iText7:
PdfReader reader = new PdfReader(templatePath);
PdfDocument pdf = new PdfDocument(reader, new PdfWriter(outputPath));
Document document = new Document(pdf);
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfFont khmerFont = PdfFontFactory.createFont(pathToKhmerFont, PdfEncodings.IDENTITY_H, true);
PdfFont font = PdfFontFactory.createFont(pathToArialUnicodeFont, PdfEncodings.IDENTITY_H, true);
pdf.addFont(khmerFont);
pdf.addFont(font);
FontSet set = new FontSet();
set.addFont(pathToKhmerFont);
set.addFont(pathToArialUnicodeFont);
document.setFontProvider(new FontProvider(set));
document.setProperty(Property.FONT, "Arial");
form.setNeedAppearances(true);
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
PdfFormField tf = form.getField("Text3");
tf.setValue(content);
// tf.setFont(khmerFont);
tf.regenerateField();
// add a paragraph just to check pdfCalligraph works
document.add(new Paragraph(content));
pdf.close();
String used to test proper rendering: "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일"
iText5 in form field without pdfCalligraph, but with substitution fonts:
iText7 in form field with pdfCalligraph loaded (set arial font field.setFont(arialFont)):
iText7 in form field with pdfCalligraph loaded (set khmer font field.setFont(khmerFont)):
iText7 same document but in a paragraph instead of form field with pdfCalligraph loaded (It is an expected resut, so it does use pdfCalligraph for paragraphs, but not for form fields):
So, as you can see there're basically 2 issues:
How do I addSubstitutionFont in iText7?
How do I use pdfCalligraph in PdfFormField appearance?
I've also checked if pdfCalligraph works in text form and it looks like it does not. Here is a code I've used to check it:
LicenseKey.loadLicenseFile(path_to_license);
String outputPath = path_to_output_doc;
PdfDocument pdf = new PdfDocument(new PdfWriter(outputPath));
Document document = new Document(pdf);
// prepare fonts for pdfCalligraph to use
FontSet set = new FontSet();
set.addFont("/path_to/Khmer.ttf");
set.addFont("/path_to/ArialUnicodeMS.ttf");
FontProvider fontProvider = new FontProvider(set);
document.setFontProvider(fontProvider);
document.setProperty(Property.FONT, "Arial");
String content = "khmer ថ្ងៃឈប់សម្រាក and chinese 假日 and korean 휴일";
// Add a paragraph to check if pdfCalligraph works
document.add(new Paragraph(content));
// Add a form text field
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
PdfTextFormField field = PdfFormField.createText(pdf, new Rectangle(36, 700, 400, 30), "test");
field.setValue(content);
form.addField(field);
document.close();
Output with pdfCalligraph dependency loaded (as you can see paragraph rendered properly, but in form all non-halvetica chars just ignored:
Output without pdfCalligraph dependency loaded (as you can see paragraph is not rendered properly which is expected, the form field looks same as with loaded pdfCalligraph):
Am I missing something?
I am trying to extract data from a specific rectangular region specified by two coordinates given inside a PDF. Is it possible to do this in a PDF or would I have to convert it into a image and use OCR? If so, does PDFBox or iText include a way to analyze images via OCR? Thanks!
If the area is text. use pdfbox,
PDDocument document = PDDocument.load(new File("target.pdf"));
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
Rectangle rect = new Rectangle(35, 375, 340, 204);
stripper.addRegion("class1", rect);
stripper.extractRegions(document.getPage(1));
System.out.println(stripper.getTextForRegion("class1")
First of all, sorry for my bad english.
I´m trying to remove the Header and Footer of a PDF page, it´s necessary to search at the page break for some words, but it´s impossible with the header and footer, so it´s necessary to crop it and then convert to text than it´s "possible" to search the words.
I´m doing it:
PDDocument pdDoc = PDDocument.load("document.pdf");
PDPage page = (PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0);
PDRectangle rectangle = new PDRectangle();
rectangle.setUpperRightY(page.findCropBox().getUpperRightY() - 100);
rectangle.setLowerLeftY(page.findCropBox().getLowerLeftY());
rectangle.setUpperRightX(page.findCropBox().getUpperRightY());
rectangle.setLowerLeftX(page.findCropBox().getLowerLeftX());
page.setMediaBox(rectangle);
PDDocument document = new PDDocument();
document.addPage(page);
document.save("newDocument.pdf");
document.close();
But when I change it to HTML, it steal keeps the text that was hidden. Is there any way to save it withou the croped area and convert to html correctly?
Thanks.
Best regard´s.