extract image from image - java

Is it possible to extract an image from a jpeg, png or tiff file? NOT PDF! Suppose I have a file containing both text and images in jpeg format (so it's basically a picture); I want to be able to extract the image only programmatically (preferably using Java). If anyone knows useful libraries please let me know. I have already tried AspriseOCR and tesseract-ocr, they have been successful at extracting text only (obviously).
Thank you.

Try :
int startProintX = xxx;
int startProintY = xxx;
int endProintX = xxx;
int endProintY = xxx;
BufferedImage image = ImageIO.read(new File("D:/temp/test.jpg"));
BufferedImage out = image.getSubimage(startProintX, startProintY, endProintX, endProintY);
ImageIO.write(out, "jpg", new File("D:/temp/result.jpg"));
These point are region of image you want to extract.
Extract image from pdf file
I suggest to change your post tile. You can use pdfbox or iText api. The below example to extract the all of the image from pdf file.
There might be some resource for you. If there are a lot of image in pdf, may be occur java.lang.OutOfMemoryError.
Download pdfbox.xx.jar here.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.PDFBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
import org.jdom.Document;
public class ExtractImagesFromPDF {
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("D:/temp/test.pdf"));
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while(iter.hasNext()) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = resources.getImages();
if( images != null ) {
Iterator imageIter = images.keySet().iterator();
while(imageIter.hasNext()) {
String key = (String)imageIter.next();
System.out.println("Key : " + key);
PDXObjectImage image = (PDXObjectImage)images.get(key);
File file = new File("D:/temp/" + key + "." + image.getSuffix());
image.write2file(file);
}
}
}
}
}
Extract specific image from pdf file
To extract specific image, you have to know index of page and index of image of that page. Otherwise, you cannot extract.
The following example program extract first image of first page.
int targetPage = 0;
PDPage firstPage = (PDPage)document.getDocumentCatalog().getAllPages().get(targetPage);
PDResources resources = firstPage.getResources();
Map images = resources.getImages();
int targetImage = 0;
String imageKey = "Im" + targetImage;
PDXObjectImage image = (PDXObjectImage)images.get(imageKey);
File file = new File("D:/temp/" + imageKey + "." + image.getSuffix());
image.write2file(file);

If you are interested in an out-of-box product that could do this via black-box processing with minimal non-programming configuration (since you tried other products), then ABBYY FlexiCapture can do it. It can be configured to look for dynamic sizes of pictures/objects in loosely defined areas, or anywhere on the page, with full control over search logic. I used it once to extract lines of specific shape and thickness to separate chapters of a book, where each line indicated a new chapter, and could be anywhere on the page.

Related

Apache PDFBox - vertical match between image and text position

I need help to achieve a mapping between text and image objects in a PDF document.
As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of the images.
My goal is to combine the texts into "ImObj" objects (see the class ImObj).
The 2nd figure shows that I want to use the height of the image to detect the position of the texts (all texts outside of the image height should be ignored). In the example, there will be 3 ImObj-objects formed by the 3 images.
The link to the pdf file is here (on wetransfer):
[enter link description here][3]
But my mapping does not work, because I probably use the wrong coordinates from the image. Now I have already looked at some examples, but I still don't really understand how to get the coordinates of text and images working together?
Here is my code:
import java.awt.Image;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.util.Matrix;
public class ImExample extends PDFTextStripper {
public static void main(String[] args) {
File file = new File("C://example document.pdf");
try {
PDDocument document = PDDocument.load(file);
ImExample example = new ImExample();
for (int pnr = 0; pnr < document.getPages().getCount(); pnr++) {
PDPage page = document.getPages().get(pnr);
PDResources res = page.getResources();
example.processPage(page);
int idx = 0;
for (COSName objName : res.getXObjectNames()) {
PDXObject xObj = res.getXObject(objName);
if (xObj instanceof PDImageXObject) {
System.out.println("...add a new image");
PDImageXObject imXObj = (PDImageXObject) xObj;
BufferedImage image = imXObj.getImage();
// Here is my mistake ... but I do not know how to solve it.
ImObj imObj = new ImObj(image, idx++, pnr, image.getMinY(), image.getMinY() + image.getHeight());
example.imObjects.add(imObj);
}
}
}
example.setSortByPosition(true);
example.getText(document);
// Output
for (ImObj iObj : example.imObjects)
System.out.println(iObj.idx + " -> " + iObj.text);
document.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public List<ImObj> imObjects = new ArrayList<ImObj>();
public ImExample() throws IOException {
super();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException {
// match between imagesize and textposition
TextPosition txtPos = textPositions.get(0);
for (ImObj im : imObjects) {
if(im.page == (this.getCurrentPageNo()-1))
if (im.minY < txtPos.getY() && (txtPos.getY() + txtPos.getHeight()) < im.maxY)
im.text.append(text + " ");
}
}
}
class ImObj {
float minY, maxY;
Image image = null;
StringBuilder text = new StringBuilder("");
int idx, page = 0;
public ImObj(Image im, int idx, int pnr, float yMin, float yMax) {
this.idx = idx;
this.image = im;
this.minY = yMin;
this.maxY = yMax;
this.page = pnr;
}
}
Best regards
You're looking for the images in the (somewhat) wrong place!
You iterate over the image XObject resources of the page itself and inspect them. But this is not helpful:
An image XObject resource merely is that, a resource. I.e. it can be used on the page, even more than once, but you cannot determine from this resource alone how it is used (where? at which scale? transformed somehow?)
There are other places an image can be stored and used on a page, e.g. in the resources of some form XObject or pattern used on the page, or inline in the content stream.
What you actually need is to parse the page content stream for uses of images and the current transformation matrix at the time of use. For a basic implementation of this have a look at the PDFBox example PrintImageLocations.
The next problem you'll run into is that the coordinates PDFBox returns in the TextPosition methods getX and getY is not from the original coordinate system of the PDF page in question but from some coordinate system normalized for the purpose of easier handling in the text extraction code. Thus, you most likely should use the un-normalized coordinates.
You can find information on that in this answer.

PDFBox does not correctly render Simsun (chinese) font

Context
I am writing a Java code which fill PDF Forms using PDFBox with some user inputs.
Some of the inputs are in Chinese.
When I generated the PDF, I don't have any errors in the logs but the rendered text is absolutely not the same.
What I currently have
Here is what I do:
In the PDF file, I specified the SimSun font for the field using Adobe Pro.
This font handle Simplified Chinese characters.
I have the font SimSun installed on my server.
PDFBox doesn't display any error (if I remove the SimSun font from my server then PDFBox fallback on another font that is not able to render the characters). So i guess it is able to find the font and use it.
What I tried
I was able to make this work but I had to manually load the font in the code and add it to the PDF (see examples below).
But that is not a solution as it means that I would have to load the font every time and add it the the PDF. I would also have to do the same for many other languages.
As far as I understood, PDFBox should be able to use any fonts installed on the server.
Below is a test class that tries 3 different approaches. Only the last one works so far:
Classic generation
Simply put Chinese characters inside the text field without changing anything.
The characters are not rendered correctly (some of them are missing and the ones displayed does not match the input).
Generation with embedded font
Try to embed the SimSun font inside the PDF with the PDResource.add(font) method.
The result is the same as the first method.
Embed the font and use it
I embed the SimSun font and I also override the font used in the TextField to use the SimSun font I just added.
This approach works.
After quite a few readings, I found out that the issue might come from the version of the font I am using.
Windows 8 (which I use to create the form) uses v5.04 of Simsun font.
I use v2.10 on my laptop and my servers, both being Linux based (I can not find the v5.04).
However, I don't know:
If the issue is really coming from this.
If I have the right to use this font, as it is developed by Microsoft (and Apple).
Where to find the latest version of it.
I tried using another font but:
I only find OTF fonts (and not TTF) that support Chinese characters.
PDFBox does not support OTF (yet). It is planed for v3.0.0.
So if someone has an idea on how to make this work without having to embed and change the font's name in the code, that would be great!
Here are the PDF I used and the code that tests the 3 methods I talked about.
The TextField in the pdf is named comment.
package org.test;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Hello world!
*/
public class App {
private static final String SIMPLIFIED_CHINESE_STRING = "我不明白为什么它不起作用。";
public static void main(String[] args) throws IOException {
System.out.println("Hello World!");
// Test 1
classicGeneration();
// Test 2
generationWithEmbededFont();
Test 3
generationWithFontOverride();
System.out.println("Bye!");
}
/**
* Classic PDF generation without any changes to the PDF.
*/
private static void classicGeneration() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-classic-generation.pdf"));
}
/**
* Trying to embed the font in the PDF. It doesn't seem to work.
* The result is the same as classicGeneration method.
*/
private static void generationWithEmbededFont() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDFont font = PDType0Font.load(document, new File("/usr/share/fonts/SimSun.ttf"));
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-embeded-font.pdf"));
}
/**
* Embed the font in the PDF and change the font used in the TextField to use this one.
* Here the PDF is correctly rendered and all the characters are displayed.
* #throws IOException
*/
private static void generationWithFontOverride() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
// Load the font
InputStream resourceAsStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("SimSun.ttf");
PDFont font = PDType0Font.load(document, resourceAsStream);
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
// Change the font used by the TextField
COSDictionary dict = commentField.getCOSObject();
COSString defaultAppearance = (COSString) dict.getDictionaryObject(COSName.DA);
if (defaultAppearance != null) {
String currentFont = dict.getString(COSName.DA);
// Retrieve the current font size and color used for the field in order to use the same but with the new font.
String regex = "[\\w]* ([\\w\\s]*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(currentFont);
// Default font size if we fail to extract the current one
String fontSize = " 11 Tf";
if (matcher.find()) {
fontSize = " " + matcher.group(1);
}
// Change the font of the TextField.
dict.setString(COSName.DA, "/" + fontName.getName() + fontSize);
}
commentField.getCOSObject().addAll(dict);
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-font-override.pdf"));
}
// HELPER
private static PDDocument loadPdf() throws IOException {
InputStream stream = Thread.currentThread().getContextClassLoader().getResourceAsStream("sample.pdf");
return PDDocument.load(stream);
}
}

Splitting a multipage TIFF image into individual images (Java)

Been tearing my hair on this one.
How do I split a multipage / multilayer TIFF image into several individual images?
Demo image available here.
(Would prefer a pure Java (i.e. non-native) solution. Doesn't matter if the solution relies on commercial libraries.)
You can use the Java Advanced Imaging library, JAI, to split a mutlipage TIFF, by using an ImageReader:
ImageInputStream is = ImageIO.createImageInputStream(new File(pathToImage));
if (is == null || is.length() == 0){
// handle error
}
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: " + pathToImage);
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
Then you can get the number of pages:
nbPages = reader.getNumImages(true);
and read pages separatly:
reader.read(numPage)
A fast but non JAVA solution is tiffsplit. It is part of the libtiff library.
An example command to split a tiff file in all it's layers would be:
tiffsplit image.tif
The manpage says it all:
NAME
tiffsplit - split a multi-image TIFF into single-image TIFF files
SYNOPSIS
tiffsplit src.tif [ prefix ]
DESCRIPTION
tiffsplit takes a multi-directory (page) TIFF file and creates one or more single-directory (page) TIFF files
from it. The output files are given names created by concatenating a prefix, a lexically ordered suffix in the
range [aaa-zzz], the suffix .tif (e.g. xaaa.tif, xaab.tif, xzzz.tif). If a prefix is not specified on the
command line, the default prefix of x is used.
OPTIONS
None.
BUGS
Only a select set of ‘‘known tags’’ is copied when splitting.
SEE ALSO
tiffcp(1), tiffinfo(1), libtiff(3TIFF)
Libtiff library home page: http://www.remotesensing.org/libtiff/
I used this sample above with a tiff plugin i found called imageio-tiff.
Maven dependency:
<dependency>
<groupId>com.tomgibara.imageio</groupId>
<artifactId>imageio-tiff</artifactId>
<version>1.0</version>
</dependency>
I was able to get the buffered images from a tiff resource:
Resource img3 = new ClassPathResource(TIFF4);
ImageInputStream is = ImageIO.createImageInputStream(img3.getInputStream());
Iterator<ImageReader> iterator = ImageIO.getImageReaders(is);
if (iterator == null || !iterator.hasNext()) {
throw new IOException("Image file format not supported by ImageIO: ");
}
// We are just looking for the first reader compatible:
ImageReader reader = (ImageReader) iterator.next();
iterator = null;
reader.setInput(is);
int nbPages = reader.getNumImages(true);
LOGGER.info("No. of pages for tiff file is {}", nbPages);
BufferedImage image1 = reader.read(0);
BufferedImage image2 = reader.read(1);
BufferedImage image3 = reader.read(2);
But then i found another project called apache commons imaging
Maven dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-imaging</artifactId>
<version>1.0-SNAPSHOT</version>
</dependency>
In one line you can get the buffered images:
List<BufferedImage> bufferedImages = Imaging.getAllBufferedImages(img3.getInputStream(), TIFF4);
LOGGER.info("No. of pages for tiff file is {} using apache commons imaging", bufferedImages.size());
Then write to file sample:
final Map<String, Object> params = new HashMap<String, Object>();
// set optional parameters if you like
params.put(ImagingConstants.PARAM_KEY_COMPRESSION, new Integer(TiffConstants.TIFF_COMPRESSION_CCITT_GROUP_4));
int i = 0;
for (Iterator<BufferedImage> iterator1 = bufferedImages.iterator(); iterator1.hasNext(); i++) {
BufferedImage bufferedImage = iterator1.next();
LOGGER.info("Image type {}", bufferedImage.getType());
File outFile = new File("C:\\tmp" + File.separator + "shane" + i + ".tiff");
Imaging.writeImage(bufferedImage, outFile, ImageFormats.TIFF, params);
}
Actually testing performance, apache is alot slower...
Or use an old version of iText, which is alot faster:
private ByteArrayOutputStream convertTiffToPdf(InputStream imageStream) throws IOException, DocumentException {
Image image;
ByteArrayOutputStream out = new ByteArrayOutputStream();
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, out);
writer.setStrictImageSequence(true);
document.open();
RandomAccessFileOrArray ra = new RandomAccessFileOrArray(imageStream);
int pages = TiffImage.getNumberOfPages(ra);
for (int i = 1; i <= pages; i++) {
image = TiffImage.getTiffImage(ra, i);
image.setAbsolutePosition(0, 0);
image.scaleToFit(PageSize.A4.getWidth(), PageSize.A4.getHeight());
document.setPageSize(PageSize.A4);
document.newPage();
document.add(image);
}
document.close();
out.flush();
return out;
}
This is how I did it with ImageIO:
public List<BufferedImage> extractImages(InputStream fileInput) throws Exception {
List<BufferedImage> extractedImages = new ArrayList<BufferedImage>();
try (ImageInputStream iis = ImageIO.createImageInputStream(fileInput)) {
ImageReader reader = getTiffImageReader();
reader.setInput(iis);
int pages = reader.getNumImages(true);
for (int imageIndex = 0; imageIndex < pages; imageIndex++) {
BufferedImage bufferedImage = reader.read(imageIndex);
extractedImages.add(bufferedImage);
}
}
return extractedImages;
}
private ImageReader getTiffImageReader() {
Iterator<ImageReader> imageReaders = ImageIO.getImageReadersByFormatName("TIFF");
if (!imageReaders.hasNext()) {
throw new UnsupportedOperationException("No TIFF Reader found!");
}
return imageReaders.next();
}
I took part of the code from this blog.
All the proposed solutions require reading the multipage image page by page and write the pages back to new TIFF images. Unless you want to save the individual pages to different image format, there is no point in decoding the image. Given the special structure of the TIFF image, you can split a multipage TIFF into single TIFF images without decoding.
The TIFF tweaking tool (part of a larger image related library - "icafe" I am using is written from scratch with pure Java. It can delete pages, insert pages, retain certain pages, split pages from a multiple page TIFF as well as merge multipage TIFF images into one TIFF image without decompressing them.
After trying with the TIFF tweaking tool, I am able to split the image into 3 pages: page#0, page#1, and page#2
NOTE1: The original demo image for some reason contains "incorrect" StripByteCounts value 1 which is not the actual bytes needed for the images strip. It turns out that the image data are not compressed, so the actual bytes for each image strip could be figured out through other TIFF field values such as RowsPerStrip, SamplesPerPixel, ImageWidth, etc.
NOTE2: Since in splitting the TIFF, the above mentioned library doesn't need to decode and re-encode the image. So it's fast and it also keeps the original encoding and additional metadata of each pages!
It works to set the compression to default param.setCompression(32946);.
public static void doitJAI(String mutitiff) throws IOException {
FileSeekableStream ss = new FileSeekableStream(mutitiff);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(32946);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++) {
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("D:/PSN/SCB/SCAN/bin/Debug/Temps/test/single_" + i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}
The below code will convert the multiple tiff into individual's and produces an Excel sheet with list of tiff images.
You need to create a folder in the C drive and place your TIFF images into it then run this code.
Need to import the below jars.
1.sun-as-jsr88-dm-4.0-sources
2./sun-jai_codec
3.sun-jai_core
import java.awt.AWTException;
import java.awt.Robot;
import java.awt.image.RenderedImage;
import java.awt.image.renderable.ParameterBlock;
import java.io.File;
import java.io.IOException;
import javax.media.jai.JAI;
import javax.media.jai.RenderedOp;
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
import com.sun.media.jai.codec.TIFFEncodeParam;
import java.io.*;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Calendar;
import javax.swing.JOptionPane;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
public class TIFF_Sepreator {
File folder = new File("C:/FAX/");
public static void infoBox(String infoMessage, String titleBar)
{
JOptionPane.showMessageDialog(null, infoMessage, "InfoBox: " + titleBar, JOptionPane.INFORMATION_MESSAGE);
}
public void splitting() throws IOException, AWTException
{
boolean FinalFAXFolder = (new File("C:/Final_FAX")).mkdirs();
File[] listOfFiles = folder.listFiles();
String dateFormat = new SimpleDateFormat("yyyyMMdd_HHmmss").format(Calendar.getInstance().getTime());
try{
if (listOfFiles.length > 0)
{
for(int file=0; file<listOfFiles.length; file++)
{
System.out.println(listOfFiles[file]);
FileSeekableStream ss = new FileSeekableStream(listOfFiles[file]);
ImageDecoder dec = ImageCodec.createImageDecoder("tiff", ss, null);
int count = dec.getNumPages();
TIFFEncodeParam param = new TIFFEncodeParam();
param.setCompression(TIFFEncodeParam.COMPRESSION_GROUP4);
param.setLittleEndian(false); // Intel
System.out.println("This TIF has " + count + " image(s)");
for (int i = 0; i < count; i++)
{
RenderedImage page = dec.decodeAsRenderedImage(i);
File f = new File("C:\\Final_FAX\\"+dateFormat+ file +i + ".tif");
System.out.println("Saving " + f.getCanonicalPath());
ParameterBlock pb = new ParameterBlock();
pb.addSource(page);
pb.add(f.toString());
pb.add("tiff");
pb.add(param);
RenderedOp r = JAI.create("filestore",pb);
r.dispose();
}
}
TIFF_Sepreator.infoBox("Find your splitted TIFF images in location 'C:/Final_FAX/' " , "Done :)");
WriteListOFFilesIntoExcel();
}
else
{
TIFF_Sepreator.infoBox("No files was found in location 'C:/FAX/' " , "Empty folder");
System.out.println("No files found");
}
}
catch(Exception e)
{
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +e , "Error");
System.out.println("Error: "+e);
}
}
public void WriteListOFFilesIntoExcel(){
File[] listOfFiles = folder.listFiles();
ArrayList<File> files = new ArrayList<File>(Arrays.asList(folder.listFiles()));
try {
String filename = "C:/Final_FAX/List_Of_Fax_Files.xls" ;
HSSFWorkbook workbook = new HSSFWorkbook();
HSSFSheet sheet = workbook.createSheet("FirstSheet");
for (int file=0; file<listOfFiles.length; file++) {
System.out.println(listOfFiles[file]);
Row r = sheet.createRow(file);
r.createCell(0).setCellValue(files.get(file).toString());
}
FileOutputStream fileOut = new FileOutputStream(filename);
workbook.write(fileOut);
fileOut.close();
System.out.println("Your excel file has been generated!");
}
catch(Exception ex){
TIFF_Sepreator.infoBox("Unabe to run due to this error: " +ex , "Error");
System.out.println("Error: "+ex);
}
}
public static void main(String[] args) throws IOException, AWTException {
new TIFF_Sepreator().splitting();
}
}

Java ImageIO-ext TIF File Corrupt when Read

I am attempting to display a .tif in Java using a minimal number of additional libraries:
import javax.imageio.ImageIO;
import javax.swing.ImageIcon;
import javax.swing.JFrame;
import javax.swing.JLabel;
import javax.swing.WindowConstants;
import javax.media.jai.widget.*;
import it.geosolutions.imageio.utilities.*;
import it.geosolutions.imageioimpl.plugins.tiff.*;
import com.sun.media.imageioimpl.common.*;
public static void main(String[] args) {
try {
File f = new File("image.tif");
BufferedImage tif = ImageIO.read(f);
ImageIcon ic = new ImageIcon(tif);
JFrame frame = new JFrame();
frame.setDefaultCloseOperation(WindowConstants.EXIT_ON_CLOSE);
JLabel label = new JLabel(ic);
frame.add(label);
frame.setVisible(true);
} catch (IOException e) {
e.printStackTrace();
}
}
The libraries I'm using are:
jai-core-1.1.3.jar
jai-imageio-1.1.jar
imageio-ext-tiff.1.1.3.jar
imageio-ext-utilities.1.1.3.jar
From here: http://java.net/projects/imageio-ext (Downloads link on right side)
However, the displayed image is:
which is decidedly not the original image. Nor are any errors being thrown that I know of. Furthermore, the original image is fine, and doesn't change.
However, the original code is small. I don't actually use the imageio-ext imports, but the program will fail without them. I also haven't used imageio-ext before either.
Please help! I need to be able to use .tif images in Java without installing software.
If you already use all JAI/ImageIO libraries, you might want to try the following (which works fine for me):
import com.sun.media.jai.codec.FileSeekableStream;
import com.sun.media.jai.codec.ImageCodec;
import com.sun.media.jai.codec.ImageDecoder;
// This function is minimal, you should add exceptions and error handling
public RenderedImage read(String filename)
FileSeekableStream fss = new FileSeekableStream(filename);
ImageDecoder decoder = ImageCodec.createImageDecoder("tiff", fss, null);
RenderedImage image = decoder.decodeAsRenderedImage()
fss.close();
return image;
}
If you need a BufferedImage instead of a RenderedImage, the only solution I found is to use this function:
public static BufferedImage Rendered2Buffered(RenderedImage image) {
BufferedImage bi = new BufferedImage(image.getWidth(), image.getHeight(), image.getSampleModel().getDataType());
bi.setData(image.getData());
return bi;
}
Be careful though, the image.getSampleModel().getDataType() usually returns a BufferedImage.TYPE_CUSTOM, which makes it impossible for the BufferedImage to be created! In my case I had to "guess" the type according to the sample size returned by image.getSampleModel().getSampleSize(0) (because I know the image format I'm working with).
If you know a better way to transform a RenderedImage to a BufferedImage, please enlighten me :)
You're correct in thinking that you need the JAI libraries to decode and use TIFF files, but even though you've imported them, you aren't actually using them!
Here is a short tutorial showing how you to create a TIFFDecodeParam object (from the JAI library), and then use that to decode (and display) a TIFF image.
You might also find the JAI API Library useful too.
I ended up going with the most-recent version of Apache-Commons Imaging (formerly Sanselan). Imaging offers out of the box support for TIFF files (I had as little bit of trouble at first, but that was solved by switching from the older Sanselan to the newer Commons Imaging).
There was a little bit of functionality I had to reverse-engineer myself (loading a single sub-TIFF at a specified width while maintaining aspect ratio):
/**
* Load a scaled sub-TIFF image. Loads nth sub-image and scales to given width; preserves aspect ratio.
*
* #param fileName String filename
* #param index Index of sub-TIFF; will throw ArrayIndexOutOfBoundsException if sub-image doesn't exist
* #param w Desired width of image; height will scale
* #return Image (BufferedImage)
* #throws IOException
* #throws ImageReadException
*/
public static Image loadScaledSubTIFF(String fileName, int index, int w) throws IOException, ImageReadException {
File imageFile = new File(fileName);
ByteSourceFile bsf = new ByteSourceFile(imageFile);
FormatCompliance formatCompliance = FormatCompliance.getDefault();
TiffReader tiffReader = new TiffReader(true);
TiffContents contents = tiffReader.readDirectories(bsf, true, formatCompliance);
TiffDirectory td = contents.directories.get(index);
Image bi = td.getTiffImage(tiffReader.getByteOrder(), null);
Object width = td.getFieldValue(new TagInfo("", 256, TiffFieldTypeConstants.FIELD_TYPE_SHORT) {/**/});
Object height = td.getFieldValue(new TagInfo("", 257, TiffFieldTypeConstants.FIELD_TYPE_SHORT) {/**/});
int newWidth = w;
int newHeight = (int) ((newWidth * ((Number)height).doubleValue()) / (((Number)width).doubleValue()));
bi = bi.getScaledInstance(w, newHeight, java.awt.Image.SCALE_FAST);
height = null;
width = null;
td = null;
contents = null;
tiffReader = null;
formatCompliance = null;
bsf = null;
return bi;
}

how to insert text into a scanned pdf document using java

I have to add text to pdf documents where there are many scanned pdf documents so the inserted text is inserted back to the scanned image and not over the image. how to add text over the scanned image inside the pdf.
package editExistingPDF;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import jxl.Cell;
import jxl.Sheet;
import jxl.Workbook;
import jxl.read.biff.BiffException;
import org.apache.commons.io.FilenameUtils;
import com.itextpdf.text.Document;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.Font;
import com.itextpdf.text.PageSize;
import com.itextpdf.text.Paragraph;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfImportedPage;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfWriter;
public class AddPragraphToPdf {
public static void main(String[] args) throws IOException, DocumentException, BiffException {
String tan = "no tan";
File inputWorkbook = new File("lars.xls");
Workbook w;
w = Workbook.getWorkbook(inputWorkbook);
// Get the first sheet
Sheet sheet = w.getSheet(0);
Cell[] tnas =sheet.getColumn(0);
File ArticleFolder = new File("C:\\Documents and Settings\\sathishkumarkk\\My Documents\\article");
File[] listOfArticles = ArticleFolder.listFiles();
for (int ArticleInList = 0; ArticleInList < listOfArticles.length; ArticleInList++)
{
Document document = new Document(PageSize.A4);
// System.out.println(listOfArticles[ArticleInList].toString());
PdfReader pdfArticle = new PdfReader(listOfArticles[ArticleInList].toString());
if(listOfArticles[ArticleInList].getName().contains(".si."))
{continue;}
int noPgs=pdfArticle.getNumberOfPages();
String ArticleNoWithOutExt = FilenameUtils.removeExtension(listOfArticles[ArticleInList].getName());
String TanNo=ArticleNoWithOutExt.substring(0,ArticleNoWithOutExt.indexOf('.'));
// Create output PDF
PdfWriter writer = PdfWriter.getInstance(document,new FileOutputStream("C:\\Documents and Settings\\sathishkumarkk\\My Documents\\toPrint\\"+ArticleNoWithOutExt+".pdf"));
document.open();
PdfContentByte cb = writer.getDirectContent();
//get tan form excel sheet
System.out.println(TanNo);
for(Cell content : tnas){
if(content.getContents().contains(TanNo)){
tan=content.getContents();
System.out.println(tan);
}else{
continue;
}
}
// Load existing PDF
//PdfReader reader = new PdfReader(new FileInputStream("1.pdf"));
for (int i = 1; i <= noPgs; i++) {
PdfImportedPage page = writer.getImportedPage(pdfArticle, i);
// Copy first page of existing PDF into output PDF
document.newPage();
cb.addTemplate(page, 0, 0);
// Add your TAN here
Paragraph p= new Paragraph(tan);
Font font = new Font();
font.setSize(1.0f);
p.setLeading(12.0f, 1.0f);
p.setFont(font);
document.add(p);
}
document.close();
}
}
}
NOTE: The problem is that when there is a pdf create with only text I have no problem but when a pdf is with full of scanned document and when I try to add text; it gets added to the back of the scanned document. so while I print those pdf I will not get those text I added.
From this iText Example (which is the reverse of what you want, but switch getUnderContent with getOverContent and you'll be fine) :
Blockquote
Each PDF page has two extra layers; one that sits on top of all text / graphics and one that goes to the bottom. All user added content gets in-between these two. If we get into this bottommost content, we can write anything under that we want. To get into this bottommost layer, we can use the " getUnderContent" method of PdfStamper object.
This is documented in iText API Reference as shown below:
public PdfContentByte getUnderContent(int pageNum)
Gets a PdfContentByte to write under the page of the original document.
Parameters:
pageNum - the page number where the extra content is written
Returns:
a PdfContentByte to write under the page of the original document
To do this, you will need to first read in the PDF document, extract the elements and then add text to the document and resave it as a PDF document. This of course assumes that you can read the PDF document in the first place.
I'd recommend iText (see Example Code iText) to help you do this.

Categories