I'm trying to convert a multi page tiff to a pdf using PDFBox and not been successful. I'm not able to use apache imaging-commons in the company as its not a stable release.
Problem: Not able to read a multi tiff and write to a pdf.
Working solution so far: Only the first page is getting written and saved to pdf. Also when a tiff is a single page, it works.
Below is the code:
PDDocument doc = new PDDocument();
log.info("Read Image");
log.info("Process Image parts");
//Get the number of pages
int pages = 0;
try(ImageInputStream imageInputStream = ImageIO.createImageInputStream(new File("src/main/resources/output/testpdf.tiff"))) {
if (imageInputStream != null && imageInputStream.length() != 0) {
Iterator<ImageReader> iteratorIO = ImageIO.getImageReaders(imageInputStream);
if (iteratorIO != null && iteratorIO.hasNext()) {
ImageReader reader = iteratorIO.next();
reader.setInput(imageInputStream);
pages = reader.getNumImages(true);
log.info("Number of pages in the tiff is " + pages);
}
}
}
//Need a reader here for different page ?
for (int i=0; i<pages; i++) {
BufferedImage bimage = ImageIO.read(file);
PDPage page = new PDPage();
doc.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(doc, page);
try {
// the .08F can be tweaked. Go up for better quality,
// but the size of the PDF will increase
PDImageXObject image = JPEGFactory.createFromImage(doc, bimage, 0.08f);
Dimension scaledDim = getScaledDimension(new Dimension(image.getWidth(), image.getHeight()),
new Dimension((int) page.getMediaBox().getWidth(), (int) page.getMediaBox().getHeight()));
contentStream.drawImage(image, 1, 1, scaledDim.width, scaledDim.height);
} finally {
contentStream.close();
}
}
doc.save("src/main/resources/output/testpdf.pdf");
doc.close();
Do I need to come up with a reader which is not provided by ImageIO?
OR
Do I need to split the tiff multi page to individual pages and then write to a pdf?
I've not worked with image manipulations much, but appreciate the level of quality the ImageIO after the conversion process!
Thanks
Try this, you need PDFBox jar and sun.jai.codec jar
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
import com.sun.media.jai.codec.SeekableStream;
import com.sun.media.jai.codec.TIFFDecodeParam;
public class FinalTtoP {
public static void main(String args[]) throws IOException
{
PDDocument document=new PDDocument();
File file = new File("C:/nn.tif"); //Enter Tiff file path
ImageInputStream isb = ImageIO.createImageInputStream(file);
Iterator<ImageReader> iterator = ImageIO.getImageReaders(isb);
if (iterator == null || !iterator.hasNext())
{
throw new IOException("Image file format not supported by ImageIO: ");
}
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(isb);
int nbPages = reader.getNumImages(true);
System.out.println(nbPages);
for(int p=0;p<nbPages;p++)
{
BufferedImage bufferedImage = reader.read(p);
PDPage page = new PDPage();
document.addPage(page);
PDImageXObject i = LosslessFactory.createFromImage(document, bufferedImage);
PDPageContentStream content =new PDPageContentStream(document, page);
content.drawImage(i, 0,0 ,page.getMediaBox().getWidth(),page.getMediaBox().getHeight());
content.close();
}
document.save("C:/nnnnm.pdf"); //Enter path to save your file with .pdf extension
document.close();
}
}
Refer to this code it will improve speed and it still slow then need to use itext.
public static byte[] convertTiffToPdf(File tiffFile) throws IOException {
ByteArrayOutputStream outStream = null;
PDDocument document = null;
ImageInputStream imgInputStream = null;
try {
outStream = new ByteArrayOutputStream();
document = new PDDocument();
PDRectangle pageSize = PDRectangle.LETTER;
int noOfPages = 0;
imgInputStream = ImageIO.createImageInputStream(tiffFile);
Iterator<ImageReader> iterator = ImageIO.getImageReaders(imgInputStream);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: ");
}
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(imgInputStream);
noOfPages = reader.getNumImages(true);
for (int i = 0; i < noOfPages; i++) {
PDPageContentStream content = null;
try {
BufferedImage bufferedImage = reader.read(i);
PDPage page = new PDPage(pageSize);
document.addPage(page);
// PDImageXObject imgObject = LosslessFactory.createFromImage(document, bufferedImage); //Commented for PR 1028
PDImageXObject imgObject = CCITTFactory.createFromFile(document, tiffFile, i); //PR 1028
//PDImageXObject imgObject = JPEGFactory.createFromImage(document, bufferedImage);
content = new PDPageContentStream(document, page);
content.drawImage(imgObject, 0, 0, pageSize.getWidth(), pageSize.getHeight());
} catch(Exception e) {
e.printStackTrace();
} finally {
content.close();
}
}
document.save(outStream);
byte[] fileBytes = outStream.toByteArray();
return fileBytes;
} finally {
if (document != null) {
document.close();
}
if (imgInputStream != null) {
imgInputStream.close();
}
if (outStream != null) {
outStream.close();
}
}
}
You can use CCITTFactory.createFromFile(PDDocument document, File file, int number) which works for most bitonal tiff files. If that one doesn't work (because the TIFF file is tiled or in color), then read the individual pages into BufferedImage objects (see here) and then use LosslessFactory.createFromImage(PDDocument document, BufferedImage image) with the result.
Related
I need to convert scanned PDF to grayscale PDF. I found 2 solutions for that.
First one is to just use renderImage
private void convertToGray() throws IOException {
File pdfFile = new File(PATH);
try (PDDocument originalPdf = PDDocument.load(pdfFile);
PDDocument doc = new PDDocument()) {
LOGGER.info("Current heap after loading file: {}", Runtime.getRuntime().totalMemory());
PDFRenderer pdfRenderer = new PDFRenderer(originalPdf);
for (int pageNum = 0; pageNum < originalPdf.getNumberOfPages(); pageNum++) {
// PDImageXObject pdImage = LosslessFactory.createFromImage(doc, bufferedImage);
BufferedImage grayImage = pdfRenderer.renderImageWithDPI(pageNum, 300F, ImageType.GRAY);
PDImageXObject pdImage = JPEGFactory.createFromImage(doc, grayImage);
float pageWight = originalPdf.getPage(pageNum).getMediaBox().getWidth();
float pageHeight = originalPdf.getPage(pageNum).getMediaBox().getHeight();
PDPage page = new PDPage(new PDRectangle(pageWight, pageHeight));
doc.addPage(page);
try (PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
contentStream.drawImage(pdImage, 0F, 0F, pageWight, pageHeight);
}
}
doc.save(NEW_PATH);
}
}
But this leads to increase size of the file (because some PDFs has less DPI than 300.
Second one is to just replace existing image with gray analog
private void convertByImageToGray() throws IOException {
File pdfFile = new File(PATH);
try (PDDocument document = PDDocument.load(pdfFile)) {
List<COSObject> objects = document.getDocument().getObjectsByType(COSName.IMAGE);
for (COSObject object : objects) {
LOGGER.info("Class: {}; {}", object.getClass(), object.toString());
}
for (int pageNum = 0; pageNum < document.getNumberOfPages(); pageNum++) {
PDPage page = document.getPage(pageNum);
replaceImage(document, page);
}
document.save(NEW_PATH);
}
}
private void replaceImage(PDDocument document, PDPage page) throws IOException {
PDResources resources = page.getResources();
Iterable<COSName> xObjectNames = resources.getXObjectNames();
if (xObjectNames != null) {
for (COSName xObjectName : xObjectNames) {
PDXObject object = resources.getXObject(xObjectName);
if (object instanceof PDImageXObject) {
PDImageXObject img1 = (PDImageXObject) object;
BufferedImage bufferedImage1 = img1.getImage();
BufferedImage grayBufferedImage = convertBufferedImageToGray(bufferedImage1);
// PDImageXObject grayImage = JPEGFactory.createFromImage(document, grayBufferedImage);
PDImageXObject grayImage = LosslessFactory.createFromImage(document, grayBufferedImage);
resources.put(xObjectName, grayImage);
}
}
}
}
private static BufferedImage convertBufferedImageToGray(BufferedImage sourceImg) {
ColorSpace cs = ColorSpace.getInstance(ColorSpace.CS_GRAY);
ColorConvertOp op = new ColorConvertOp(sourceImg.getColorModel().getColorSpace(), cs, null);
op.filter(sourceImg, sourceImg);
return sourceImg;
}
But still some files increase in size like 3 times (even they were already grayscale; interesting that int this case JPEGFactory produces larger files than LosslessFactory). All images in grayscale PDF have the same size as original ones. And I don't understand why.
Maybe there is a better way to make grayscale PDF with predictable size (except ghostscript)?
UPDATE: I've just realized that the issue is with creating PDF from image. It does not compress as well.
For example, I have dummy 1-page scan file that is less than 1 Mb. But if I get image from it (directly copying via Acrobat Reader to Paint, or via code above) it size is ~8-10 Mb depending on the method. And if I create new PDF from this image it's barely compressed. Here is example code:
File pdfFile = new File(FULL_FILE);
try (PDDocument document = PDDocument.load(pdfFile)) {
PDPage page = new PDPage();
document.addPage(page);
PDImageXObject pdImage = PDImageXObject.createFromFile("example.png", document);
try (PDPageContentStream contents = new PDPageContentStream(document, page)) {
contents.drawImage(pdImage, 0F, 0F);
}
document.save(FULL_FILE_NEW);
}
Yes LosslessFactory produces smaller files compared to JPEGFactory
In the below link there are different methods to try and achieve the same goal. Overall the best quality gray scale image was the one from Option 6, however this was by no means the fastest (I myself used Option 4). Comparisons are also provided for you to choose
This link contains possible ways to convert color images to black. It helped me a lot.
Let me know if it works for you and approve my answer if it helped.
Once you've loaded a document:
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
How do you get the page by page printing color intent from the PDDocument? I read the docs, didn't see coverage.
This gets the output intents (you'll get these with high quality PDF files) and also the icc profiles for colorspaces and images:
PDDocument doc = PDDocument.load(new File("XXXXX.pdf"));
for (PDOutputIntent oi : doc.getDocumentCatalog().getOutputIntents())
{
COSStream destOutputIntent = oi.getDestOutputIntent();
String info = oi.getOutputCondition();
if (info == null || info.isEmpty())
{
info = oi.getInfo();
}
InputStream is = destOutputIntent.createInputStream();
FileOutputStream fos = new FileOutputStream(info + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
for (COSName name : page.getResources().getColorSpaceNames())
{
PDColorSpace cs = page.getResources().getColorSpace(name);
if (cs instanceof PDICCBased)
{
PDICCBased iccCS = (PDICCBased) cs;
InputStream is = iccCS.getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
for (COSName name : page.getResources().getXObjectNames())
{
PDXObject x = page.getResources().getXObject(name);
if (x instanceof PDImageXObject)
{
PDImageXObject img = (PDImageXObject) x;
if (img.getColorSpace() instanceof PDICCBased)
{
InputStream is = ((PDICCBased) img.getColorSpace()).getPDStream().createInputStream();
FileOutputStream fos = new FileOutputStream(System.currentTimeMillis() + ".icc");
IOUtils.copy(is, fos);
fos.close();
is.close();
}
}
}
}
doc.close();
What this doesn't do (but I could add some of it if needed):
colorspaces of shadings, patterns, xobject forms, appearance stream resources
recursion in colorspaces like DeviceN and Separation
recursion in patterns, xobject forms, soft masks
I read the examples on "How to create/add Intents to a PDF file". I couldn't get an example on "How to get intents". Using the API/examples, I wrote the following (untested code) to get the COSStream object for each of the Intents. See if this is useful for you.
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
for (PDOutputIntent e : list) {
p("PDOutputIntent Found:");
p("Info="+e.getInfo());
p("OutputCondition="+e.getOutputCondition());
p("OutputConditionIdentifier="+e.getOutputConditionIdentifier());
p("RegistryName="+e.getRegistryName());
COSStream cstr = e.getDestOutputIntent();
}
static void p(String s) {
System.out.println(s);
}
}
Using itext pdf library (fork of an older version 4.2.1) you could do smth. like:
PdfReader reader = new com.lowagie.text.pdf.PdfReader(Path pathToPdf);
PRStream stream = (PRStream) reader.getCatalog().getAsDict(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
For extracting the profile from each page you could iterate over each page like
for(int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
PRStream prStream = (PRStream) pdfReader.getPageN(i).getDirectObject(PdfName.DESTOUTPUTPROFILE);
if (stream != null)
{
byte[] destProfile = PdfReader.getStreamBytes(stream);
}
}
I don't know whether this code help or not, after searching below links,
How do I add an ICC to an existing PDF document
PdfBox - PDColorSpaceFactory.createColorSpace(document, iccColorSpace) throws nullpointerexception
https://pdfbox.apache.org/docs/1.8.11/javadocs/org/apache/pdfbox/pdmodel/graphics/color/PDICCBased.html
I found some code, check whether it help or not,
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("blah.pdf"));
PDDocumentCatalog cat = doc.getDocumentCatalog();
List<PDOutputIntent> list = cat.getOutputIntents();
PDDocumentCatalog cat = doc.getDocumentCatalog();
COSArray cosArray = doc.getCOSObject();
PDICCBased pdCS = new PDICCBased( cosArray );
pdCS.getNumberOfComponents()
static void p(String s) {
System.out.println(s);
}
}
I want to convert a PDF page to image file. Text is missing when I convert a PDF page to image using java.
The file which I want to convert 46_2.pdf after converting it shown me like 46_2.png
Code:
import java.awt.image.BufferedImage;
import java.io.File;
import java.util.List;
import javax.imageio.ImageIO;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
public class ConvertPDFPageToImageWithoutText {
public static void main(String[] args) {
try {
String oldPath = "C:/PDFCopy/46_2.pdf";
File oldFile = new File(oldPath);
if (oldFile.exists()) {
PDDocument document = PDDocument.load(oldPath);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
for (PDPage page : list) {
BufferedImage image = page.convertToImage();
File outputfile = new File("C:/PDFCopy/image.png");
ImageIO.write(image, "png", outputfile);
document.close();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Since you're using PDFBox, try using PDFImageWriter.writeToImage instead of PDPage.convertToImage. This post seems relevant to what you are trying to do.
I had the same problem. I found an article(unfortunally can't remember where because I've read hundred of them). There an author complained that appeared such problems in PDFBox after they updated the Java version to 7.21. So I'm using 7.17 and it works for me:)
Use the latest version of PDFBox(I am using 2.0.9) and add JAI Image I/O dependency from here. This is sample running code on JAVA 7.
public void pdfToImageConvertorUsingPdfBox(String inputPdfPath) throws Exception {
File sourceFile = new File(inputPdfPath);
String formatName = "png";
if (sourceFile.exists()) {
PDDocument document = PDDocument.load(sourceFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);
int count = document.getNumberOfPages();
for (int i = 0; i < count; i++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(i, 200, ImageType.RGB);
String output = FilenameUtils.removeExtension(inputPdfPath) + "_" + (i + 1) + "." + formatName;
ImageIO.write(image, formatName, new File(output));
}
document.close();
} else {
logger.error(sourceFile.getName() + " File not exists");
}
}
Been tearing my hair on this one.
How do I split a multipage / multilayer TIFF image into several individual images?
Demo image available here.
(Would prefer a pure Java (i.e. non-native) solution. Doesn't matter if the solution relies on commercial libraries.)
You can use the Java Advanced Imaging library, JAI, to split a mutlipage TIFF, by using an ImageReader:
ImageInputStream is = ImageIO.createImageInputStream(new File(pathToImage));
if (is == null || is.length() == 0){
// handle error
}
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: " + pathToImage);
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
Then you can get the number of pages:
nbPages = reader.getNumImages(true);
and read pages separatly:
reader.read(numPage)
A fast but non JAVA solution is tiffsplit. It is part of the libtiff library.
An example command to split a tiff file in all it's layers would be:
tiffsplit image.tif
The manpage says it all:
NAME
tiffsplit - split a multi-image TIFF into single-image TIFF files
SYNOPSIS
tiffsplit src.tif [ prefix ]
DESCRIPTION
tiffsplit takes a multi-directory (page) TIFF file and creates one or more single-directory (page) TIFF files
from it. The output files are given names created by concatenating a prefix, a lexically ordered suffix in the
range [aaa-zzz], the suffix .tif (e.g. xaaa.tif, xaab.tif, xzzz.tif). If a prefix is not specified on the
command line, the default prefix of x is used.
OPTIONS
None.
BUGS
Only a select set of ‘‘known tags’’ is copied when splitting.
SEE ALSO
tiffcp(1), tiffinfo(1), libtiff(3TIFF)
Libtiff library home page: http://www.remotesensing.org/libtiff/
I used this sample above with a tiff plugin i found called imageio-tiff.
Maven dependency:
<dependency>
<groupId>com.tomgibara.imageio</groupId>
<artifactId>imageio-tiff</artifactId>
<version>1.0</version>
</dependency>
I was able to get the buffered images from a tiff resource:
Resource img3 = new ClassPathResource(TIFF4);
ImageInputStream is = ImageIO.createImageInputStream(img3.getInputStream());
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: ");
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
int nbPages = reader.getNumImages(true);
LOGGER.info("No. of pages for tiff file is {}", nbPages);
BufferedImage image1 = reader.read(0);
BufferedImage image2 = reader.read(1);
BufferedImage image3 = reader.read(2);
But then i found another project called apache commons imaging
Maven dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-imaging</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
In one line you can get the buffered images:
List<BufferedImage> bufferedImages = Imaging.getAllBufferedImages(img3.getInputStream(), TIFF4);
LOGGER.info("No. of pages for tiff file is {} using apache commons imaging", bufferedImages.size());
Then write to file sample:
final Map<String, Object> params = new HashMap<String, Object>();
// set optional parameters if you like
params.put(ImagingConstants.PARAM_KEY_COMPRESSION, new Integer(TiffConstants.TIFF_COMPRESSION_CCITT_GROUP_4));
int i = 0;
for (Iterator<BufferedImage> iterator1 = bufferedImages.iterator(); iterator1.hasNext(); i++) {
BufferedImage bufferedImage = iterator1.next();
LOGGER.info("Image type {}", bufferedImage.getType());
File outFile = new File("C:\\tmp" + File.separator + "shane" + i + ".tiff");
Imaging.writeImage(bufferedImage, outFile, ImageFormats.TIFF, params);
}
Actually testing performance, apache is alot slower...
Or use an old version of iText, which is alot faster:
private ByteArrayOutputStream convertTiffToPdf(InputStream imageStream) throws IOException, DocumentException {
Image image;
ByteArrayOutputStream out = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, out);
writer.setStrictImageSequence(true);
document.open();
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(imageStream);
int pages = TiffImage.getNumberOfPages(ra);
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(ra, i);
image.setAbsolutePosition(0, 0);
image.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
document.setPageSize(PageSize.A4);
document.newPage();
document.add(image);
}
document.close();
out.flush();
return out;
}
This is how I did it with ImageIO:
public List<BufferedImage> extractImages(InputStream fileInput) throws Exception {
List<BufferedImage> extractedImages = new ArrayList<BufferedImage>();
try (ImageInputStream iis = ImageIO.createImageInputStream(fileInput)) {
ImageReader reader = getTiffImageReader();
reader.setInput(iis);
int pages = reader.getNumImages(true);
for (int imageIndex = 0; imageIndex < pages; imageIndex++) {
BufferedImage bufferedImage = reader.read(imageIndex);
extractedImages.add(bufferedImage);
}
}
return extractedImages;
}
private ImageReader getTiffImageReader() {
Iterator<ImageReader> imageReaders = ImageIO.getImageReadersByFormatName("TIFF");
if (!imageReaders.hasNext()) {
throw new UnsupportedOperationException("No TIFF Reader found!");
}
return imageReaders.next();
}
I took part of the code from this blog.
All the proposed solutions require reading the multipage image page by page and write the pages back to new TIFF images. Unless you want to save the individual pages to different image format, there is no point in decoding the image. Given the special structure of the TIFF image, you can split a multipage TIFF into single TIFF images without decoding.
The TIFF tweaking tool (part of a larger image related library - "icafe" I am using is written from scratch with pure Java. It can delete pages, insert pages, retain certain pages, split pages from a multiple page TIFF as well as merge multipage TIFF images into one TIFF image without decompressing them.
After trying with the TIFF tweaking tool, I am able to split the image into 3 pages: page#0, page#1, and page#2
NOTE1: The original demo image for some reason contains "incorrect" StripByteCounts value 1 which is not the actual bytes needed for the images strip. It turns out that the image data are not compressed, so the actual bytes for each image strip could be figured out through other TIFF field values such as RowsPerStrip, SamplesPerPixel, ImageWidth, etc.
NOTE2: Since in splitting the TIFF, the above mentioned library doesn't need to decode and re-encode the image. So it's fast and it also keeps the original encoding and additional metadata of each pages!
It works to set the compression to default param.setCompression(32946);.
public static void doitJAI(String mutitiff) throws IOException {
FileSeekableStream ss = new FileSeekableStream(mutitiff);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(32946);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++) {
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("D:/PSN/SCB/SCAN/bin/Debug/Temps/test/single_" + i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}
The below code will convert the multiple tiff into individual's and produces an Excel sheet with list of tiff images.
You need to create a folder in the C drive and place your TIFF images into it then run this code.
Need to import the below jars.
1.sun-as-jsr88-dm-4.0-sources
2./sun-jai_codec
3.sun-jai_core
import java.awt.AWTException;
import java.awt.Robot;
import java.awt.image.RenderedImage;
import java.awt.image.renderable.ParameterBlock;
import java.io.File;
import java.io.IOException;
import javax.media.jai.JAI;
import javax.media.jai.RenderedOp;
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
import com.sun.media.jai.codec.TIFFEncodeParam;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import javax.swing.JOptionPane;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
public class TIFF_Sepreator {
File folder = new File("C:/FAX/");
public static void infoBox(String infoMessage, String titleBar)
{
JOptionPane.showMessageDialog(null, infoMessage, "InfoBox: " + titleBar, JOptionPane.INFORMATION_MESSAGE);
}
public void splitting() throws IOException, AWTException
{
boolean FinalFAXFolder = (new File("C:/Final_FAX")).mkdirs();
File[] listOfFiles = folder.listFiles();
String dateFormat = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
try{
if (listOfFiles.length > 0)
{
for(int file=0; file<listOfFiles.length; file++)
{
System.out.println(listOfFiles[file]);
FileSeekableStream ss = new FileSeekableStream(listOfFiles[file]);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(TIFFEncodeParam.COMPRESSION_GROUP4);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++)
{
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("C:\\Final_FAX\\"+dateFormat+ file +i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}
TIFF_Sepreator.infoBox("Find your splitted TIFF images in location 'C:/Final_FAX/' " , "Done :)");
WriteListOFFilesIntoExcel();
}
else
{
TIFF_Sepreator.infoBox("No files was found in location 'C:/FAX/' " , "Empty folder");
System.out.println("No files found");
}
}
catch(Exception e)
{
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +e , "Error");
System.out.println("Error: "+e);
}
}
public void WriteListOFFilesIntoExcel(){
File[] listOfFiles = folder.listFiles();
ArrayList<File> files = new ArrayList<File>(Arrays.asList(folder.listFiles()));
try {
String filename = "C:/Final_FAX/List_Of_Fax_Files.xls" ;
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("FirstSheet");
for (int file=0; file<listOfFiles.length; file++) {
System.out.println(listOfFiles[file]);
Row r = sheet.createRow(file);
r.createCell(0).setCellValue(files.get(file).toString());
}
FileOutputStream fileOut = new FileOutputStream(filename);
workbook.write(fileOut);
fileOut.close();
System.out.println("Your excel file has been generated!");
}
catch(Exception ex){
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +ex , "Error");
System.out.println("Error: "+ex);
}
}
public static void main(String[] args) throws IOException, AWTException {
new TIFF_Sepreator().splitting();
}
}
How to make Tess4J get image from PDF file?
I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.
File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf
I tested this example with Asprise, but I don't find any example like this on Tess4J
import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;
PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++) {
BufferedImage img = reader.getPageAsImage(i);
// recognizes both characters and barcodes
String text = new OCR().recognizeAll(image);
System.out.println("Page " + i + ": " + text);
}
reader.close(); // finally, close the file.
make use of pdfutilities.convertpdf2png and use it like you did before with images.
Tess4j has a dependency on pdfbox, so you can use this library. It could be something like this:
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
PDDocument document = PDDocument.load(new File("YOUR_PDF_FILE_PATH"));
PDFRenderer pdfRenderer = new PDFRenderer(document);
ITesseract tesseract = new Tesseract();
tesseract.setDatapath("tessdata");
tesseract.setLanguage("spa");
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
try {
String str = tesseract.doOCR(bufferedImage);
System.out.println(str);
} catch (TesseractException ex) {
Logger.getLogger(OCR.class.getName()).log(Level.SEVERE, null, ex);
}
}
document.close();
I'm using here Tessj4 4.5 and pdf-box 2.0.
You can also check
https://colwil.com/how-to-extract-text-from-a-scanned-pdf-using-ocr-in-java/.