Splitting a multipage TIFF image into individual images (Java) - java

Been tearing my hair on this one.
How do I split a multipage / multilayer TIFF image into several individual images?
Demo image available here.
(Would prefer a pure Java (i.e. non-native) solution. Doesn't matter if the solution relies on commercial libraries.)

You can use the Java Advanced Imaging library, JAI, to split a mutlipage TIFF, by using an ImageReader:
ImageInputStream is = ImageIO.createImageInputStream(new File(pathToImage));
if (is == null || is.length() == 0){
// handle error
}
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: " + pathToImage);
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
Then you can get the number of pages:
nbPages = reader.getNumImages(true);
and read pages separatly:
reader.read(numPage)

A fast but non JAVA solution is tiffsplit. It is part of the libtiff library.
An example command to split a tiff file in all it's layers would be:
tiffsplit image.tif
The manpage says it all:
NAME
tiffsplit - split a multi-image TIFF into single-image TIFF files
SYNOPSIS
tiffsplit src.tif [ prefix ]
DESCRIPTION
tiffsplit takes a multi-directory (page) TIFF file and creates one or more single-directory (page) TIFF files
from it. The output files are given names created by concatenating a prefix, a lexically ordered suffix in the
range [aaa-zzz], the suffix .tif (e.g. xaaa.tif, xaab.tif, xzzz.tif). If a prefix is not specified on the
command line, the default prefix of x is used.
OPTIONS
None.
BUGS
Only a select set of ‘‘known tags’’ is copied when splitting.
SEE ALSO
tiffcp(1), tiffinfo(1), libtiff(3TIFF)
Libtiff library home page: http://www.remotesensing.org/libtiff/

I used this sample above with a tiff plugin i found called imageio-tiff.
Maven dependency:
<dependency>
<groupId>com.tomgibara.imageio</groupId>
<artifactId>imageio-tiff</artifactId>
<version>1.0</version>
</dependency>
I was able to get the buffered images from a tiff resource:
Resource img3 = new ClassPathResource(TIFF4);
ImageInputStream is = ImageIO.createImageInputStream(img3.getInputStream());
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: ");
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
int nbPages = reader.getNumImages(true);
LOGGER.info("No. of pages for tiff file is {}", nbPages);
BufferedImage image1 = reader.read(0);
BufferedImage image2 = reader.read(1);
BufferedImage image3 = reader.read(2);
But then i found another project called apache commons imaging
Maven dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-imaging</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
In one line you can get the buffered images:
List<BufferedImage> bufferedImages = Imaging.getAllBufferedImages(img3.getInputStream(), TIFF4);
LOGGER.info("No. of pages for tiff file is {} using apache commons imaging", bufferedImages.size());
Then write to file sample:
final Map<String, Object> params = new HashMap<String, Object>();
// set optional parameters if you like
params.put(ImagingConstants.PARAM_KEY_COMPRESSION, new Integer(TiffConstants.TIFF_COMPRESSION_CCITT_GROUP_4));
int i = 0;
for (Iterator<BufferedImage> iterator1 = bufferedImages.iterator(); iterator1.hasNext(); i++) {
BufferedImage bufferedImage = iterator1.next();
LOGGER.info("Image type {}", bufferedImage.getType());
File outFile = new File("C:\\tmp" + File.separator + "shane" + i + ".tiff");
Imaging.writeImage(bufferedImage, outFile, ImageFormats.TIFF, params);
}
Actually testing performance, apache is alot slower...
Or use an old version of iText, which is alot faster:
private ByteArrayOutputStream convertTiffToPdf(InputStream imageStream) throws IOException, DocumentException {
Image image;
ByteArrayOutputStream out = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, out);
writer.setStrictImageSequence(true);
document.open();
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(imageStream);
int pages = TiffImage.getNumberOfPages(ra);
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(ra, i);
image.setAbsolutePosition(0, 0);
image.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
document.setPageSize(PageSize.A4);
document.newPage();
document.add(image);
}
document.close();
out.flush();
return out;
}

This is how I did it with ImageIO:
public List<BufferedImage> extractImages(InputStream fileInput) throws Exception {
List<BufferedImage> extractedImages = new ArrayList<BufferedImage>();
try (ImageInputStream iis = ImageIO.createImageInputStream(fileInput)) {
ImageReader reader = getTiffImageReader();
reader.setInput(iis);
int pages = reader.getNumImages(true);
for (int imageIndex = 0; imageIndex < pages; imageIndex++) {
BufferedImage bufferedImage = reader.read(imageIndex);
extractedImages.add(bufferedImage);
}
}
return extractedImages;
}
private ImageReader getTiffImageReader() {
Iterator<ImageReader> imageReaders = ImageIO.getImageReadersByFormatName("TIFF");
if (!imageReaders.hasNext()) {
throw new UnsupportedOperationException("No TIFF Reader found!");
}
return imageReaders.next();
}
I took part of the code from this blog.

All the proposed solutions require reading the multipage image page by page and write the pages back to new TIFF images. Unless you want to save the individual pages to different image format, there is no point in decoding the image. Given the special structure of the TIFF image, you can split a multipage TIFF into single TIFF images without decoding.
The TIFF tweaking tool (part of a larger image related library - "icafe" I am using is written from scratch with pure Java. It can delete pages, insert pages, retain certain pages, split pages from a multiple page TIFF as well as merge multipage TIFF images into one TIFF image without decompressing them.
After trying with the TIFF tweaking tool, I am able to split the image into 3 pages: page#0, page#1, and page#2
NOTE1: The original demo image for some reason contains "incorrect" StripByteCounts value 1 which is not the actual bytes needed for the images strip. It turns out that the image data are not compressed, so the actual bytes for each image strip could be figured out through other TIFF field values such as RowsPerStrip, SamplesPerPixel, ImageWidth, etc.
NOTE2: Since in splitting the TIFF, the above mentioned library doesn't need to decode and re-encode the image. So it's fast and it also keeps the original encoding and additional metadata of each pages!

It works to set the compression to default param.setCompression(32946);.
public static void doitJAI(String mutitiff) throws IOException {
FileSeekableStream ss = new FileSeekableStream(mutitiff);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(32946);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++) {
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("D:/PSN/SCB/SCAN/bin/Debug/Temps/test/single_" + i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}

The below code will convert the multiple tiff into individual's and produces an Excel sheet with list of tiff images.
You need to create a folder in the C drive and place your TIFF images into it then run this code.
Need to import the below jars.
1.sun-as-jsr88-dm-4.0-sources
2./sun-jai_codec
3.sun-jai_core
import java.awt.AWTException;
import java.awt.Robot;
import java.awt.image.RenderedImage;
import java.awt.image.renderable.ParameterBlock;
import java.io.File;
import java.io.IOException;
import javax.media.jai.JAI;
import javax.media.jai.RenderedOp;
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
import com.sun.media.jai.codec.TIFFEncodeParam;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import javax.swing.JOptionPane;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
public class TIFF_Sepreator {
File folder = new File("C:/FAX/");
public static void infoBox(String infoMessage, String titleBar)
{
JOptionPane.showMessageDialog(null, infoMessage, "InfoBox: " + titleBar, JOptionPane.INFORMATION_MESSAGE);
}
public void splitting() throws IOException, AWTException
{
boolean FinalFAXFolder = (new File("C:/Final_FAX")).mkdirs();
File[] listOfFiles = folder.listFiles();
String dateFormat = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
try{
if (listOfFiles.length > 0)
{
for(int file=0; file<listOfFiles.length; file++)
{
System.out.println(listOfFiles[file]);
FileSeekableStream ss = new FileSeekableStream(listOfFiles[file]);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(TIFFEncodeParam.COMPRESSION_GROUP4);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++)
{
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("C:\\Final_FAX\\"+dateFormat+ file +i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}
TIFF_Sepreator.infoBox("Find your splitted TIFF images in location 'C:/Final_FAX/' " , "Done :)");
WriteListOFFilesIntoExcel();
}
else
{
TIFF_Sepreator.infoBox("No files was found in location 'C:/FAX/' " , "Empty folder");
System.out.println("No files found");
}
}
catch(Exception e)
{
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +e , "Error");
System.out.println("Error: "+e);
}
}
public void WriteListOFFilesIntoExcel(){
File[] listOfFiles = folder.listFiles();
ArrayList<File> files = new ArrayList<File>(Arrays.asList(folder.listFiles()));
try {
String filename = "C:/Final_FAX/List_Of_Fax_Files.xls" ;
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("FirstSheet");
for (int file=0; file<listOfFiles.length; file++) {
System.out.println(listOfFiles[file]);
Row r = sheet.createRow(file);
r.createCell(0).setCellValue(files.get(file).toString());
}
FileOutputStream fileOut = new FileOutputStream(filename);
workbook.write(fileOut);
fileOut.close();
System.out.println("Your excel file has been generated!");
}
catch(Exception ex){
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +ex , "Error");
System.out.println("Error: "+ex);
}
}
public static void main(String[] args) throws IOException, AWTException {
new TIFF_Sepreator().splitting();
}
}

Related

How to convert one pdf to multiple png images with multithreading

I used the following method to convert a pdf into multiple png images:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.imgscalr.Scalr;
public class ImgUtil {
public static List<String> convertPDFPagesToImages(String sourceFilePath, String desFilePath){
List<String> urlList = new ArrayList<>();
try {
File sourceFile = new File(sourceFilePath);
File destinationFile = new File(desFilePath);
if (!destinationFile.exists()) {
destinationFile.mkdir();
log.info("Folder Created ->:{} ", destinationFile.getAbsolutePath());
}
if (sourceFile.exists()) {
log.info("Images copied to Folder Location: ", destinationFile.getAbsolutePath());
PDDocument document = PDDocument.load(sourceFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);
int numberOfPages = document.getNumberOfPages();
log.info("Total files to be converting ->{} ", numberOfPages);
String fileName = sourceFile.getName().replace(".pdf", "");
String fileExtension = "png";
/*
* 600 dpi give good image clarity but size of each image is 2x times of 300 dpi.
* Ex: 1. For 300dpi 04-Request-Headers_2.png expected size is 797 KB
* 2. For 600dpi 04-Request-Headers_2.png expected size is 2.42 MB
*/
int dpi = 300;// use less dpi for to save more space in harddisk. For professional usage you can use more than 300dpi
for (int i = 0; i < numberOfPages; ++i) {
File outPutFile = new File(desFilePath + fileName +"_"+ (i+1) +"."+ fileExtension);
BufferedImage bImage = pdfRenderer.renderImageWithDPI(i, dpi, ImageType.RGB);
ImageIO.write(bImage, fileExtension, outPutFile);
urlList.add(outPutFile.getPath().replaceAll("\\\\", "/"));
}
document.close();
log.info("Converted Images are saved at ->{} ", destinationFile.getAbsolutePath());
} else {
log.error(sourceFile.getName() +" File not exists");
}
} catch (Exception e) {
e.printStackTrace();
}
return urlList;
}
public static void main(String[] args) {
convertPDFPagesToImages("D:\\tmp\\report\\pdfPath\\61199020100754118.pdf", "D:\\tmp\\report\\pdfPath\\");
}
}
But I found that when the number of pdf pages is relatively large, the image conversion is slower. I consider using multithreading to parse the images. Is it possible to convert a pdf into a picture through multiple threads or is there a similar method?
A simple way to speed up this conversion would be to split image writing to a background thread. Set up an executorService before opening the PDF:
ExecutorService exec = Executors.newFixedThreadPool(1);
List<Future<?>> pending = new ArrayList<>();
Instead of writing the image in same calling thread, just submit a new task to the service:
// ImageIO.write(bImage, fileExtension, outPutFile);
pending.add(exec.submit(() -> write(bImage, fileExtension, outImage.toFile())));
And function to perform the task:
private static void write(BufferedImage image, String fileExtension, File file) {
try {
ImageIO.write(image, fileExtension, file);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
After closing the PDF document make sure the executor is finished:
for (Future<?> fut : pending) {
fut.get();
}
exec.shutdown();
exec.awaitTermination(365, TimeUnit.DAYS);
Using more than one thread for ImageIO.write may not benefit you as it is heavy IO operation but as I said in the comment, experiment with writing to a large ByteArrayOutputStream and then the file may also help on your specific hardware.

Read multi page Tiff image and write to a pdf in Java

I'm trying to convert a multi page tiff to a pdf using PDFBox and not been successful. I'm not able to use apache imaging-commons in the company as its not a stable release.
Problem: Not able to read a multi tiff and write to a pdf.
Working solution so far: Only the first page is getting written and saved to pdf. Also when a tiff is a single page, it works.
Below is the code:
PDDocument doc = new PDDocument();
log.info("Read Image");
log.info("Process Image parts");
//Get the number of pages
int pages = 0;
try(ImageInputStream imageInputStream = ImageIO.createImageInputStream(new File("src/main/resources/output/testpdf.tiff"))) {
if (imageInputStream != null && imageInputStream.length() != 0) {
Iterator<ImageReader> iteratorIO = ImageIO.getImageReaders(imageInputStream);
if (iteratorIO != null && iteratorIO.hasNext()) {
ImageReader reader = iteratorIO.next();
reader.setInput(imageInputStream);
pages = reader.getNumImages(true);
log.info("Number of pages in the tiff is " + pages);
}
}
}
//Need a reader here for different page ?
for (int i=0; i<pages; i++) {
BufferedImage bimage = ImageIO.read(file);
PDPage page = new PDPage();
doc.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(doc, page);
try {
// the .08F can be tweaked. Go up for better quality,
// but the size of the PDF will increase
PDImageXObject image = JPEGFactory.createFromImage(doc, bimage, 0.08f);
Dimension scaledDim = getScaledDimension(new Dimension(image.getWidth(), image.getHeight()),
new Dimension((int) page.getMediaBox().getWidth(), (int) page.getMediaBox().getHeight()));
contentStream.drawImage(image, 1, 1, scaledDim.width, scaledDim.height);
} finally {
contentStream.close();
}
}
doc.save("src/main/resources/output/testpdf.pdf");
doc.close();
Do I need to come up with a reader which is not provided by ImageIO?
OR
Do I need to split the tiff multi page to individual pages and then write to a pdf?
I've not worked with image manipulations much, but appreciate the level of quality the ImageIO after the conversion process!
Thanks
Try this, you need PDFBox jar and sun.jai.codec jar
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import javax.imageio.ImageIO;
import javax.imageio.ImageReader;
import javax.imageio.stream.ImageInputStream;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
import com.sun.media.jai.codec.SeekableStream;
import com.sun.media.jai.codec.TIFFDecodeParam;
public class FinalTtoP {
public static void main(String args[]) throws IOException
{
PDDocument document=new PDDocument();
File file = new File("C:/nn.tif"); //Enter Tiff file path
ImageInputStream isb = ImageIO.createImageInputStream(file);
Iterator<ImageReader> iterator = ImageIO.getImageReaders(isb);
if (iterator == null || !iterator.hasNext())
{
throw new IOException("Image file format not supported by ImageIO: ");
}
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(isb);
int nbPages = reader.getNumImages(true);
System.out.println(nbPages);
for(int p=0;p<nbPages;p++)
{
BufferedImage bufferedImage = reader.read(p);
PDPage page = new PDPage();
document.addPage(page);
PDImageXObject i = LosslessFactory.createFromImage(document, bufferedImage);
PDPageContentStream content =new PDPageContentStream(document, page);
content.drawImage(i, 0,0 ,page.getMediaBox().getWidth(),page.getMediaBox().getHeight());
content.close();
}
document.save("C:/nnnnm.pdf"); //Enter path to save your file with .pdf extension
document.close();
}
}
Refer to this code it will improve speed and it still slow then need to use itext.
public static byte[] convertTiffToPdf(File tiffFile) throws IOException {
ByteArrayOutputStream outStream = null;
PDDocument document = null;
ImageInputStream imgInputStream = null;
try {
outStream = new ByteArrayOutputStream();
document = new PDDocument();
PDRectangle pageSize = PDRectangle.LETTER;
int noOfPages = 0;
imgInputStream = ImageIO.createImageInputStream(tiffFile);
Iterator<ImageReader> iterator = ImageIO.getImageReaders(imgInputStream);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: ");
}
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(imgInputStream);
noOfPages = reader.getNumImages(true);
for (int i = 0; i < noOfPages; i++) {
PDPageContentStream content = null;
try {
BufferedImage bufferedImage = reader.read(i);
PDPage page = new PDPage(pageSize);
document.addPage(page);
// PDImageXObject imgObject = LosslessFactory.createFromImage(document, bufferedImage); //Commented for PR 1028
PDImageXObject imgObject = CCITTFactory.createFromFile(document, tiffFile, i); //PR 1028
//PDImageXObject imgObject = JPEGFactory.createFromImage(document, bufferedImage);
content = new PDPageContentStream(document, page);
content.drawImage(imgObject, 0, 0, pageSize.getWidth(), pageSize.getHeight());
} catch(Exception e) {
e.printStackTrace();
} finally {
content.close();
}
}
document.save(outStream);
byte[] fileBytes = outStream.toByteArray();
return fileBytes;
} finally {
if (document != null) {
document.close();
}
if (imgInputStream != null) {
imgInputStream.close();
}
if (outStream != null) {
outStream.close();
}
}
}
You can use CCITTFactory.createFromFile(PDDocument document, File file, int number) which works for most bitonal tiff files. If that one doesn't work (because the TIFF file is tiled or in color), then read the individual pages into BufferedImage objects (see here) and then use LosslessFactory.createFromImage(PDDocument document, BufferedImage image) with the result.

Text is missing when converting pdf file into image in java using pdfbox

I want to convert a PDF page to image file. Text is missing when I convert a PDF page to image using java.
The file which I want to convert 46_2.pdf after converting it shown me like 46_2.png
Code:
import java.awt.image.BufferedImage;
import java.io.File;
import java.util.List;
import javax.imageio.ImageIO;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
public class ConvertPDFPageToImageWithoutText {
public static void main(String[] args) {
try {
String oldPath = "C:/PDFCopy/46_2.pdf";
File oldFile = new File(oldPath);
if (oldFile.exists()) {
PDDocument document = PDDocument.load(oldPath);
List<PDPage> list = document.getDocumentCatalog().getAllPages();
for (PDPage page : list) {
BufferedImage image = page.convertToImage();
File outputfile = new File("C:/PDFCopy/image.png");
ImageIO.write(image, "png", outputfile);
document.close();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Since you're using PDFBox, try using PDFImageWriter.writeToImage instead of PDPage.convertToImage. This post seems relevant to what you are trying to do.
I had the same problem. I found an article(unfortunally can't remember where because I've read hundred of them). There an author complained that appeared such problems in PDFBox after they updated the Java version to 7.21. So I'm using 7.17 and it works for me:)
Use the latest version of PDFBox(I am using 2.0.9) and add JAI Image I/O dependency from here. This is sample running code on JAVA 7.
public void pdfToImageConvertorUsingPdfBox(String inputPdfPath) throws Exception {
File sourceFile = new File(inputPdfPath);
String formatName = "png";
if (sourceFile.exists()) {
PDDocument document = PDDocument.load(sourceFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);
int count = document.getNumberOfPages();
for (int i = 0; i < count; i++) {
BufferedImage image = pdfRenderer.renderImageWithDPI(i, 200, ImageType.RGB);
String output = FilenameUtils.removeExtension(inputPdfPath) + "_" + (i + 1) + "." + formatName;
ImageIO.write(image, formatName, new File(output));
}
document.close();
} else {
logger.error(sourceFile.getName() + " File not exists");
}
}

Get Image from the document using Apache POI

I am using Apache Poi to read images from docx.
Here is my code:
enter code here
public Image ReadImg(int imageid) throws IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("import.docx"));
BufferedImage jpg = null;
List<XWPFPictureData> pic = doc.getAllPictures();
XWPFPictureData pict = pic.get(imageid);
String extract = pict.suggestFileExtension();
byte[] data = pict.getData();
//try to read image data using javax.imageio.* (JDK 1.4+)
jpg = ImageIO.read(new ByteArrayInputStream(data));
return jpg;
}
It reads images properly but not in order wise.
For example, if document contains
image1.jpeg
image2.jpeg
image3.jpeg
image4.jpeg
image5.jpeg
It reads
image4
image3
image1
image5
image2
Could you please help me to resolve it?
I want to read the images order wise.
Thanks,
Sithik
public static void extractImages(XWPFDocument docx) {
try {
List<XWPFPictureData> piclist = docx.getAllPictures();
// traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator = piclist.iterator();
int i = 0;
while (iterator.hasNext()) {
XWPFPictureData pic = iterator.next();
byte[] bytepic = pic.getData();
BufferedImage imag = ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("D:/imagefromword/" + pic.getFileName()));
i++;
}
} catch (Exception e) {
System.exit(-1);
}
}

Make Tess4J get image from PDF file

How to make Tess4J get image from PDF file?
I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.
File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf
I tested this example with Asprise, but I don't find any example like this on Tess4J
import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;
PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++) {
BufferedImage img = reader.getPageAsImage(i);
// recognizes both characters and barcodes
String text = new OCR().recognizeAll(image);
System.out.println("Page " + i + ": " + text);
}
reader.close(); // finally, close the file.
make use of pdfutilities.convertpdf2png and use it like you did before with images.
Tess4j has a dependency on pdfbox, so you can use this library. It could be something like this:
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
PDDocument document = PDDocument.load(new File("YOUR_PDF_FILE_PATH"));
PDFRenderer pdfRenderer = new PDFRenderer(document);
ITesseract tesseract = new Tesseract();
tesseract.setDatapath("tessdata");
tesseract.setLanguage("spa");
for (int page = 0; page < document.getNumberOfPages(); page++) {
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);
try {
String str = tesseract.doOCR(bufferedImage);
System.out.println(str);
} catch (TesseractException ex) {
Logger.getLogger(OCR.class.getName()).log(Level.SEVERE, null, ex);
}
}
document.close();
I'm using here Tessj4 4.5 and pdf-box 2.0.
You can also check
https://colwil.com/how-to-extract-text-from-a-scanned-pdf-using-ocr-in-java/.

Categories