I need to OCR a readable pdf again to generate hOCR file.
I am using,
Apache PDFBox -> parse the readable pdf
Qoppa -> OCR the pdf
Currently, I am trying iText to create a pdf after converting PDF page to image using PDFRenderer from PDfbox.
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("iTextImageExample.pdf"));
document.open();
Image img = Image.getInstance("C:\\Projects\\Data Mapping\\singularity-data-jockey-pdflib\\src\\test\\resources\\data\\pageimg.jpg");
document.add(img);
I also tried with Qoppa PDfDocument class.
The issue is, the output pdf generated from the image is low quality and OCR is not accurate. Is there any other way to convert the pdf to non searchable pdf?
Related
I have been using PDFBOX and EasyTable which extends PDFBOX to draw datatables. I have hit a problem whereby I have a java object with a string of HTML data that I need to be added to the PDF using PDFBOX. A dig at the documentation seems not to bear any fruits.
The code below is a snippet hello world, which I want on the pdf been generated to have H1 formatting.
// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );
// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;
// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);
// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );
contentStream.drawString( "<h1>HelloWorld</h1>" );
contentStream.endText();
// Make sure that the content stream is closed:
contentStream.close();
// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();
}
Use jerico to format the html to free text while mapping correctly the output of tags.
sample
public String extractAllText(String htmlText){
return new net.htmlparser.jericho
.Source(htmlText)
.getRenderer()
.setMaxLineLength(Integer.MAX_VALUE)
.setNewLine(null)
.toString();
}
Include on your gradle or maven:
compile group: 'net.htmlparser.jericho', name: 'jericho-html', version: '3.4'
PDFBox does not know HTML, at least not for creating content.
Thus, with plain PDFBox you have to parse the HTML yourself and derive special text drawing characteristics from the tags text is in.
E.g. when you encounter "<h1>HelloWorld</h1>", you have to extract the text "HelloWorld" and use the information that it is in a h1 tag to select an appropriate prime header font and font size to draw that "HelloWorld".
Alternatively you can look for a library doing that HTML parsing and transforming to PDF text drawing instructions for PDFBox, e.g. Open HTML to PDF.
I'm developing a Java application that has to process a folder with PDF/A files, adding a page with some information to each of them using Apache's PDFBox library. The problem is that the output PDF file after adding the information is not PDF/A anymore. This is a validation test from the website: https://www.pdf-online.com/osa/validate.aspx:
And this is the relevant part of the code that I use to generate the PDF file:
String pdfFileName = this.baseFolder+this.extendedPDFFileName;
File file = new File(pdfFileName);
PDDocument pdfFile = PDDocument.load(file);
PDPage pag = new PDPage();
// As a test, simply adding a page makes the PDF unvalid as PDF/A
pdfFile.addPage(pag);
pdfFile.save(file);
pdfFile.close();
What could I do to keep the PDF/A format validity? Thanks in advance,
As Tilman Hausherr suggested, the problem has been solved by adding a PDResources object to the new page, like this:
pag.setResources(new PDResources());
Now I'm having troubles with the embedded fonts, but this is another question :)
Many thanks!
You create a normal PDF in your code, you should create a valid PDF/A from the start.
Here's a link: https://pdfbox.apache.org/1.8/cookbook/pdfacreation.html
For my work, I need to translate pdf document into Image with PDFBox.
PDDocument document = PDDocument.load(new File(fileUrl));
PDFRenderer pdfRenderer=new PDFRenderer(document);
BufferedImage bim=pdfRenderer.renderImageWithDPI(page, dpi.floatValue());
My document have many optional content group that are not visible (with Acrobat Reader for example) but after rendering my image contains this ocg.
How render pdf document without render all ocg ?
I want to convert dynamic html to pdf. Following code show the conversion of static html to pdf:
Document document = new Document();
// step 2
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream("d:/sample/pdfaskkea.pdf"));
// step 3
document.open();
// step 4
XMLWorkerHelper.getInstance().parseXHtml(writer, document,new FileInputStream("webcontent/jsp/index.jsp"), null);
// XMLWorkerHelper.getInstance().parseXHtml(writer, document,new FileInputStream("C:\\pdf_table1.html"), null);
//step 5
document.close();
System.out.println( "PDF Created!" );
From your question it is not clear, what you mean under "dynamic HTML".
If it is an HTML dynamically created with JSP, for example, PD4ML offers a JSP custom tag library - you only need to surround your code with and to output PDF instead of HTML.
If under dynamic HTML you mean JavaScript-rich HTML pages, I would recommend to take a look at PhantomJS, which can convert HTMLs also built on-a-fly with JavaScript. PhantomJS is a native standalone application, based on WebKit.
You can use itext pdf library to convert html into rich PDF files. To generate dynamic HTML content you can use a template library like thymeleaf.
I have a detailed article about generating PDF files with thymeleaf in a spring boot application if you are interested.
I am using the following code to generate a PDF file of the HTML Report
String url = new File("Test.html").toURI().toURL().toString();
OutputStream os = new FileOutputStream("Test.pdf");
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(url);
renderer.layout();
renderer.createPDF(os);
os.close();
I was able to use it on sample HTML files to convert to pdf. But when it comes to my real usage, the HTML content consists of various special symbols, like &,<,> that can't be parsed by XML.
I tried using CDATA, while generating HTML itself, but later found that the text around CDATA is not visible in HMTL.
Does anyone have a solution for this?
Have you tried to print to pdf from the browser? Google primo pdf for a program that we'll let you do it.
I don't know if this will help you, but you can use StringEscapeUtils from apache-commons. It has methods for escape and unescape HTML (you may use them to pre-process your HTML before PDF generation).