Read data from image in PDF - java

I am using iText java TextExtraction to read text from PDF file. I use below code and it works fine for PDF in English Now I have PDF containing data as image. I want to read data from that image
public class pdfreader {
public static void main(String[] args) throws IOException, DocumentException, TransformerException {
String SRC = "";
String DEST = "";
for (String s : args) {
SRC = args[0];
DEST = args[1];
}
File file = new File(DEST);
file.getParentFile().mkdirs();
new pdfreader().readText(SRC, DEST);
}
public void readText(String src, String dest) throws IOException, DocumentException, TransformerException {
try {
PdfReader pdfReader = new PdfReader(src);
PdfReaderContentParser PdfParser = new PdfReaderContentParser(
pdfReader);
PrintWriter out = new PrintWriter(new FileOutputStream(
dest));
TextExtractionStrategy textStrategy;
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++) {
textStrategy = PdfParser.processContent(i,
new SimpleTextExtractionStrategy());
out.println(textStrategy.getResultantText());
}
out.flush();
out.close();
pdfReader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}

You can implement an OCR workflow with iText. As Amedee already hinted, this is something we have tried at iText, with very promising results.
The algorithm (high level):
Implement IEventListener to parse pages of your document
Look out for ImageRenderInfo events, they are fired when the PDF parser hits an image
You can call getImage() on the event and ultimately get a BufferedImage
Feed the BufferedImage to Tesseract
Apply the coordinate transform (tesseract does not use the same coordinate space as iText)
Now that you have the texf in the image, and the location, you can use iText to overlay text on your PDF. Or simply extract it.

iText doesn't support OCR to extract text from images. Try to use Tesseract or something else.

Related

Convert PDF to JPG2000 file(s)

I recently started working on this project where I need to convert a PDF File into a JPEG2000 file(s) - 1 jp2 file per page -.
The goal was to replace a previous pdf to jpeg converter method we had, in order to reduce the size of the output file(s).
Based on a code I found on the internet, I made the pdftojpeg2000 converter method below, and I've been changing the setEncodingRate parameter value and comparing the results.
I managed to get smaller jpeg2000 output files, but the quality is very poor, compared to the Jpeg ones, specially for colored text or images.
Here is what my orginal pdf file looks like:
When I set setEncodingRate to 0.8 it looks like this:
My output file size is 850Ko, which is even bigger than the Jpeg (around 600Ko) ones, and lower quality.
At 0.1 setEncodingRate, the file size is considerably small, 111 Ko, but basically unreadable.
So basically what I'm trying to get here is smaller output files ( <600K ) with a better quality, And I'm wondering if it is feasible with the Jpeg2000 format.
public class ImageConverter {
public void compressor(String inputFile, String outputFile) throws IOException {
J2KImageWriteParam iwp = new J2KImageWriteParam();
PDDocument document = PDDocument.load(new File (inputFile), MemoryUsageSetting.setupMixed(10485760L));
PDFRenderer pdfRenderer = new PDFRenderer(document);
int nbPages = document.getNumberOfPages();
int pageCounter = 0;
BufferedImage image;
for (PDPage page : document.getPages()) {
if (page.hasContents()) {
image = pdfRenderer.renderImageWithDPI(pageCounter, 300, ImageType.RGB);
if (image == null)
{
System.out.println("If no registered ImageReader claims to be able to read the resulting stream");
}
Iterator writers = ImageIO.getImageWritersByFormatName("JPEG2000");
String name = null;
ImageWriter writer = null;
while (name != "com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageWriter") {
writer = (ImageWriter) writers.next();
name = writer.getClass().getName();
System.out.println(name);
}
File f = new File(outputFile+"_"+pageCounter+".jp2");
long s = System.currentTimeMillis();
ImageOutputStream ios = ImageIO.createImageOutputStream(f);
writer.setOutput(ios);
J2KImageWriteParam param = (J2KImageWriteParam) writer.getDefaultWriteParam();
IIOImage ioimage = new IIOImage(image, null, null);
param.setSOP(true);
param.setWriteCodeStreamOnly(true);
param.setProgressionType("layer");
param.setLossless(true);
param.setCompressionMode(J2KImageWriteParam.MODE_EXPLICIT);
param.setCompressionType("JPEG2000");
param.setCompressionQuality(0.01f);
param.setEncodingRate(1.01);
param.setFilter(J2KImageWriteParam.FILTER_53 );
writer.write(null, ioimage, param);
System.out.println(System.currentTimeMillis() - s);
writer.dispose();
ios.flush();
ios.close();
image.flush();
pageCounter++;
}
}
}
public static void main(String[] args) {
String input = "E:/IMGTEST/mail-DOC0002.pdf";
String output = "E:/IMGTEST/mail-DOC0002/docamail-DOC0002-";
ImageConverter imgcv = new ImageConverter();
try {
imgcv.compressor(input, output);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

How can I merge the documents consisting PDFs as well as images?

In my current code I am merging the documents consisting PDF files.
public static void appApplicantDownload(File file) {
Connection con = getConnection();
Scanner sc = new Scanner(System.in);
List < InputStream > list = new ArrayList < InputStream > ();
try {
OutputStream fOriginal = new FileOutputStream(file, true); // original
list.add(new FileInputStream(file1));
list.add(new FileInputStream(file2));
list.add(new FileInputStream(file3));
doMerge(list, fOriginal);
} catch (Exception e) {
}
}
public static void doMerge(List < InputStream > list, OutputStream outputStream) throws DocumentException, IOException {
try {
System.out.println("Merging...");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, outputStream);
document.open();
for (InputStream in : list) {
ByteArrayOutputStream b = new ByteArrayOutputStream();
IOUtils.copy( in , b);
PdfReader reader = new PdfReader(b.toByteArray());
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
copy.addPage(copy.getImportedPage(reader, i));
}
}
outputStream.flush();
document.close();
outputStream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
But now I want to change the code such that it should allow Image as well as PDFs to be merged. Above code is giving me the error that No PDF Signature found
First, you have to know somehow if the file is a PDF or an image. The easiest way would be to use the extension of the file. So you would have get extension of the file and then pass this information to your doMerge method. Do achieve that, I would change your current method
public static void doMerge(List<InputStream> list, OutputStream outputStream)
For something like
public static void doMerge(Map<InputStream, String> files, OutputStream outputStream)
So each InputStream is associated with a extension.
Second, you have to load images and pdf seperately. So your loop should look like
for (InputStream in : files.keySet()) {
String ext = files.get(in);
if(ext.equalsIgnoreCase("pdf"))
{
//load pdf here
}
else if(ext.equalsIgnoreCase("png") || ext.equalsIgnoreCase("jpg"))
{
//load image here
}
}
In java, you can easily load an Image using ImageIO. Look at this question for more details on this topic: Load image from a filepath via BufferedImage
Then to add your image into the PDF, use the PdfWriter
PdfWriter pw = PdfWriter.GetInstance(doc, outputStream);
Image img = Image.GetInstance(inputStream);
doc.Add(img);
doc.NewPage();
If you want to convert your image into PDF before and merge after you also can do that but you just have to use the PdfWriter to write them all first.
Create a new page in the opened PDF document and use the function for inserting images to place the image on that page.

How do you extract color profiles from a PDF file using pdfbox (or other open source Java lib)

Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.
This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks
I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}
Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}
I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}

Convert image to pdf with iText and Java

I succesfully converted image files (gif, png, jpg, bmp) to pdf's using iText 1.3.
I can't change the version since we can't just change versions of a jar obviously in a professional environment.
The problem that I have is that the size of the image in the pdf is larger than the image itself. I am not talking about the file size but about the size of the image when the zoom is set to 100% on both the original image file and the pdf.
The pdf shows the image about a 20% to 30% bigger than the original image.
What am I doing wrong?
public void convertOtherImages2pdf(byte[] in, OutputStream out, String title, String author) throws IOException {
Image image = Image.getInstance(in);
Rectangle imageSize = new Rectangle(image.width() + 1f, image.height() + 1f);
image.scaleAbsolute(image.width(), image.height());
com.lowagie.text.Document document = new com.lowagie.text.Document(imageSize, 0, 0, 0, 0);
PdfWriter writer = PdfWriter.getInstance(document, out);
document.open();
document.add(image);
document.close();
writer.close();
}
Make MultipleImagesToPdf Class
public void imagesToPdf(String destination, String pdfName, String imagFileSource) throws IOException, DocumentException {
Document document = new Document(PageSize.A4, 20.0f, 20.0f, 20.0f, 150.0f);
String desPath = destination;
File destinationDirectory = new File(desPath);
if (!destinationDirectory.exists()){
destinationDirectory.mkdir();
System.out.println("DESTINATION FOLDER CREATED -> " + destinationDirectory.getAbsolutePath());
}else if(destinationDirectory.exists()){
System.out.println("DESTINATION FOLDER ALREADY CREATED!!!");
}else{
System.out.println("DESTINATION FOLDER NOT CREATED!!!");
}
File file = new File(destinationDirectory, pdfName + ".pdf");
FileOutputStream fileOutputStream = new FileOutputStream(file);
PdfWriter pdfWriter = PdfWriter.getInstance(document, fileOutputStream);
document.open();
System.out.println("CONVERTER START.....");
String[] splitImagFiles = imagFileSource.split(",");
for (String singleImage : splitImagFiles) {
Image image = Image.getInstance(singleImage);
document.setPageSize(image);
document.newPage();
image.setAbsolutePosition(0, 0);
document.add(image);
}
document.close();
System.out.println("CONVERTER STOPTED.....");
}
public static void main(String[] args) {
try {
MultipleImagesToPdf converter = new MultipleImagesToPdf();
Scanner sc = new Scanner(System.in);
System.out.print("Enter your destination folder where save PDF \n");
// Destination = D:/Destination/;
String destination = sc.nextLine();
System.out.print("Enter your PDF File Name \n");
// Name = test;
String name = sc.nextLine();
System.out.print("Enter your selected image files name with source folder \n");
String sourcePath = sc.nextLine();
// Source = D:/Source/a.jpg,D:/Source/b.jpg;
if (sourcePath != null || sourcePath != "") {
converter.imagesToPdf(destination, name, sourcePath);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
=================================
Here i use itextpdf-5.4.0 Library
Happy Coding :)
You need to scale the image by the dpi.
float scale = 72 / dpi;
I don't know if such an ancient iText has that image information, more recent iText versions have it.

How to get image of a PDF page(included text). not images in a PDF page

I have tried a PDF page to image, But just extracted each images in the PDF page. not page image.
Below Code :
public class ExtractionPDFtoThumbImgs {
static String filePath = "/Users/tmdtjq/Downloads/PDFTest/test.pdf";
static String outputFilePath = "/Users/tmdtjq/Downloads/PDFTest/pageimages";
public static void change(File inputFile, File outputFolder) throws IOException {
//TODO check the input file exists and is PDF
//TODO for the treatment of PDF encrypted
PDDocument doc = null;
try {
doc = PDDocument.load(inputFile);
List<PDPage> allPages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i <allPages.size(); i++) {
PDPage page = allPages.get(i);
page.convertToImage();
BufferedImage image = page.convertToImage();
ImageIO.write(image, "jpg", new File(outputFolder.getAbsolutePath() + File.separator + (i + 1) + ".jpg"));
}
} finally {
if (doc != null) {
doc.close();
}
}
}
public static void main(String[] args) {
File inputFile = new File(ExtractionPDFtoThumbImgs.filePath);
File outputFolder = new File(ExtractionPDFtoThumbImgs.outputFilePath);
if(!outputFolder.exists()){
outputFolder.mkdirs();
}
try {
ExtractionPDFtoThumbImgs.change(inputFile, outputFolder);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Above code extract images in PDF page. not convert image in PDF page(included text).
Are there converting tools (PDF page to image) or Converting PDFBox class?
Please Suggest how to get image of a PDF page(included text). not to get images in a PDF page.
Try pdftocairo, it's part of poppler.
I was using imagemagick to convert PDF to images, but it relies on Ghostscript which is sometimes picky about the PDF you feed it so it was hit or miss...
So far pdftocairo has been solid.
http://poppler.freedesktop.org

Categories