Extract TIFF images from PDF without decoding

Extract TIFF images from PDF without decoding - java

With the help of iText 5 I would like to extract all TIFF images from given PDF file and save them as TIFF files.
Examples and other posts (1, 2) use the following method:
Create PdfImageObject from PDF stream which in line 189 decodes the image stream (if corresponding filter implementation is present).
Call PdfImageObject#getImageAsBytes() which returns JPEG (original), PNG (re-encoded) or TIFF (in case of 8 bits per pixel).
As a result TIFF image with 1 bit color depth is converted to PNG, which is not what I need.
Another approach would be to call PdfImageObject#getBufferedImage() which will decode the image in step (2) into raster and afterwards encode it again as TIFF using ImageIO.write(bufferedImage, "tiff", file).
As one can see this is not efficient. Another solution shown in this post demonstrates how to save encoded TIFF image stream to file by prepending it a TIFF header – that is the solution I am looking for.
Can iText help here?

PDF images are not TIFF images.
PDFs however can contain images that use compression techniques that are also used in TIFF, e.g. Flate, CCITT, LZW, JPEG.

Related

Android : How to iterate through webp frames using android's ImageDecoder

I want to convert the animated webp into gif and I have gif encoder+decoder and webp ecnoder and it is working fine with gifs only. I want to process the animated webp as well so I need to decode the animated webp first and get bitamps for each frames. I could not get any animated webp decoder and later found that android.graphic has Image decoder which support animated webp image but it shown example for drawable and it has start() method for animated webp.
How can I iterate through each frames to convert them into bitmap or some data type like byte[], base64, streams, etc so that i can convet that into bitmap.
File file = new File(...);
ImageDecoder.Source source = ImageDecoder.createSource(file);
Drawable drawable = ImageDecoder.decodeDrawable(source);

As alternative for achieving same goal I have solved this by using Glide and APNG4 library along with some encoder decoder available on git.
You can do both encode decode and and other stuff alone with APNG4.
https://github.com/penfeizhou/APNG4Android

Here is how we can extract frames from animated webp file without using any third party library.
According to Google's Container Specification for WebP image format,
We need to read the image in specific way and you can do that with almost any language you like.
In Java you can create InputStream of animated webp file and read data in 4 bytes in sequence.
There is library android-webp-encoder for encoding webp image and written in pure java.
Although you can use it for decoding the image as well. Need to modify the the library. I have modified it but not published yet. Soon I will upload it on github as I fix the bugs.
But I can explain how to modify that library to decode frames or write your own codes to decode.
First create inputstream of image
Read data in 4 bytes chunks till the end of file.
Reading:
Read 4 bytes and check if it is 'RIFF' characters.
Then read next 4 bytes. This is file size.
After file size next 4 bytes must be 'WEBP' characters
Next 4 bytes will give 'VP8X' characters. Our actual image data and parameters starts from here.
Next 4 bytes must should contain value 10 as after that we need to read 10 bytes in specific manner stated in the google's container specification.
After VP8X, ANIM and other optional chunks we have to read ANMF followed by ALPH (optional) data, VP8/VP8L data. these are the actual image data we need to extract and create bitmaps out of it.
Each ANMF occurrence will signal us about each frames.
You can write static webp image data to ByteArrayOutputStream and create
bitmap using BitmapFactory.decodeByteArray(stream). This will return bitmap image of that frame.

Lossless image extraction from PDF

I'm using PDFBox to extract images out of a PDF file and feed it to another image processing library (that can handle different image formats). My current code is like this:
PDImageXObject pdImage;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
BufferedImage image = pdImage.getImage();
ImageIO.write(image, "png", baos);
byte[] imageBytes = baos.toByteArray();
This will take whatever is stored in the PDF file and use Java graphics to convert it to PNG. Is there a better way to avoid conversion and extract the image in whatever format it is embedded? I don't want to degrade image quality (I suppose mitigated by using a lossless format like PNG?) and incur conversion overhead.

The DEFLATE algorithm is used by the FlateDecode filter and by the PNG file format. However a stream of FlateDecode-compressed data isn't itself a PNG file.
Also, you need to consider the colorspace representation of the Image XObject (e.g. DeviceCMYK) versus what PNG actually supports.
By targeting lossless compression for your output image file you won't lose any information. (Be sure you actually need a lossless extracted image, often people assume lossy compression means their image will now have so many changes it's no longer recognizable. Though in many cases depending on the parameters the loss is hardly noticeable to the naked eye and you can substantially benefit from the size savings of Lossy compression.)
If performance is slow it could simply be the quality of your PDF software responsible for extracting the image and saving it.

Working with JPEG images in Java

I am using the BufferedImage class to read in an image as pixels which I then use bit shifting to get their appropriate components into separate int arrays. This works OK.
I have used this reference site to manually perform DCT functions with the pixel arrays.
Methods used: forwardDCT(), quantitizeMatrix(), dequantitzeMatrax(), inverseDCT()
which then are fed back into a resultant image array to reconstruct the JPEG file, which I then use BufferedImage's write() method to write the pixel data back out as the image.
This works perfectly, and I can view the image. (Even better the value I use to compress visually works).
My question is, is there a way to write the quantitize coefficients as the compressed values as a JPEG?
Because the BufferedImage write() method is used to input pixel data, rather than coefficient data?
Hope this is clear.
Thanks

Ultimately the DCT calculation is just one step in the whole JPEG encoding process. A complete implementation also has to deal with quantization, Huffman encoding, and conforming with the JPEG standard.
Java effectively just gives you an interface to a JPEG encoder that lets you do useful things like save images.
The ImageWriter that ImageIO.write() uses for JPEG images depends on your system. The default ImageWriter for JPEGs will only let you change some settings that affect the quantization and encoding using the JPEGImageWriteParam class (http://docs.oracle.com/javase/6/docs/api/javax/imageio/ImageWriteParam.html).
Getting your hand-crafted DCT coefficients into a JPEG file could potentially involve writing an entire JPEG library. If you don't want to do all that work, then you could modify the source of an existing library so that it uses your DCT coefficients.

Before the DCT . . .
While JPEG has no knowledge of colors, it is normal for JPEG file formats to use the YCbCr color space. If you are thinking about writing a JPEG file, you would need to do this conversion first.
After the Quantization . . .
The coefficients are run length encoded. That's a step you'd have to add. That's the most complex part of JPEG encoding.

Best way to convert between Image Types in embedded Java?

I'd like to convert (back and forth) the following
- PS to TIFF
- TIFF to PDF
- PDF to TIFF
- GIF to TIFF
- JPEG to TIFF
- TIFF (LZW) to TIFF (CITT G4)
Where, if not specified, TIFF is CITT G4 encoding.
Within embedded code of a Java app; any recommended solutions?

Java supports many formats out of the box and writing the code to do the conversions is simple and straight forward. PDF is not supported as standard however but there are plenty of libraries out there that will decode it - for instance PDF Box.
You can use ImageIO to read & write many image formats. For example, here is how you might convert between a JPEG and Bitmap.
// Read the JPEG
File input = new File("c:/image.jpg");
BufferedImage image = ImageIO.read(input);
// Write the Bitmap
File output = new File("c:/image.bmp");
ImageIO.write(image, "bmp", output);
For ImageIO (more specifically and ImageReader / Writer) to recognize a particular image format, there must exist an ImageReaderSPI & ImageWriterSPI registered with the IIOServiceProvider. So, if you want to use ImageIO to read / write unsupported formats such as PDF, you must write your own implementat6ion or download a library that has them. Writing them is pretty easy, I have done so in the past.

Multipart tiff and EXIF metadata

In the tif format, when you add EXIF meta data it creates an new IFD (tif-direcory) and stores the exif metadata as fields. So when parsing a tif file with a single image and exif data is easy. But you can get multipart tiffs, where a tif can contain more then one image, the question is can each of these images have EXIF data?
Does this create a new IFD for each pictures metadata?
What is is the arrangement of the IFD's then?
The tif specification doesn't go into any detail, I know that when a single image tif file has EXIF data there is an offset field to the EXIF data, so I can jump to that location and do the parsing myself, but the Java Sanselan library gives me easy access to the EXIF IFD and fields, but if it is possible to multiple EXIF IFD's (one for each image) then the library doesn't tell me to which image the data belongs.
If you cannot have more then 1 EXIF IFD in a multipart tif file, then it'll be trivial! In other words:
Do I need to go to the effort of manually parsing the exif data? Because I only need to do this if you can attach EXIF data to each image inside a multipart tif.
Or does anyone know of a good Linux app that allows me to add EXIF data to tif files so I can figure it out for myself?

To answer your questions:
can each of these images have EXIF data? Does this create a new IFD for each pictures metadata? What is is the arrangement of the IFD's then?
Yes, each of these images can have it's own EXIF data. Each image is related to its own IFD and each EXIF data is a SUB-IFD inside the corresponding image IFD.
but the Java Sanselan library gives me easy access to the EXIF IFD and fields, but if it is possible to multiple EXIF IFD's (one for each image) then the library doesn't tell me to which image the data belongs.
I never used Sanselan and it's successor Apache Imaging so I guess there could be two things happening here: first, Sanselan may by default choose the first page for a multipage TIFF if you actually can insert EXIF to a multipage TIFF; or there might be a parameter which you can set somewhere with a method like setWorkingPage(int page) and this is what I am doing with "icafe" Java image library.
The following is a bit more detailed information as to what is happening inside a TIFF image when you need to add EXIF metadata:
For a single page TIFF, there is a "main" IFD which specifies all the information regarding the image contained there. When EXIF data is needed, an specially tag called "EXIF_SUB_IFD" is added to the main IFD. The value for this tag is an offset address with regards to the image stream start. Now if we jump to the address specified by the offset, we will actually find a "sub" IFD with exactly the same structure as the "main" IFD which contains all the EXIF data.
The above mentioned structure is exactly like a directory tree and hence the name IFD. There is however a subtle difference here: the main IFD should contain the actual image data but the EXIF sub-IFD doesn't. In fact, there is also a GPS sub-IFD which is in parallel with the EXIF sub-IFD and with the same structure as well. An interesting thing is the data for the EXIF can be stored anywhere inside the TIFF image stream (as long as it doesn't break other part of the directory and image data).
Now comes to the multipage TIFF. The pages can be related or not. The last 4 bytes of each page IFD points to the offset of another IFD. They are sometimes gathering together to serve as a "single" document which could be from a scanner. That said, each page is itself a "single" page TIFF which could contain it's own EXIF metadata just like a single page TIFF.

You probably want to check out ExifTool. It works pretty well for what I use it on (JPEGs), but I've never used it with TIFF files containing multiple images. Also check ImageMagick, he has a ton of useful tools.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.