Convert PDF to JPG2000 file(s)

Convert PDF to JPG2000 file(s) - java

I recently started working on this project where I need to convert a PDF File into a JPEG2000 file(s) - 1 jp2 file per page -.
The goal was to replace a previous pdf to jpeg converter method we had, in order to reduce the size of the output file(s).
Based on a code I found on the internet, I made the pdftojpeg2000 converter method below, and I've been changing the setEncodingRate parameter value and comparing the results.
I managed to get smaller jpeg2000 output files, but the quality is very poor, compared to the Jpeg ones, specially for colored text or images.
Here is what my orginal pdf file looks like:
When I set setEncodingRate to 0.8 it looks like this:
My output file size is 850Ko, which is even bigger than the Jpeg (around 600Ko) ones, and lower quality.
At 0.1 setEncodingRate, the file size is considerably small, 111 Ko, but basically unreadable.
So basically what I'm trying to get here is smaller output files ( <600K ) with a better quality, And I'm wondering if it is feasible with the Jpeg2000 format.
public class ImageConverter {
public void compressor(String inputFile, String outputFile) throws IOException {
J2KImageWriteParam iwp = new J2KImageWriteParam();
PDDocument document = PDDocument.load(new File (inputFile), MemoryUsageSetting.setupMixed(10485760L));
PDFRenderer pdfRenderer = new PDFRenderer(document);
int nbPages = document.getNumberOfPages();
int pageCounter = 0;
BufferedImage image;
for (PDPage page : document.getPages()) {
if (page.hasContents()) {
image = pdfRenderer.renderImageWithDPI(pageCounter, 300, ImageType.RGB);
if (image == null)
{
System.out.println("If no registered ImageReader claims to be able to read the resulting stream");
}
Iterator writers = ImageIO.getImageWritersByFormatName("JPEG2000");
String name = null;
ImageWriter writer = null;
while (name != "com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageWriter") {
writer = (ImageWriter) writers.next();
name = writer.getClass().getName();
System.out.println(name);
}
File f = new File(outputFile+"_"+pageCounter+".jp2");
long s = System.currentTimeMillis();
ImageOutputStream ios = ImageIO.createImageOutputStream(f);
writer.setOutput(ios);
J2KImageWriteParam param = (J2KImageWriteParam) writer.getDefaultWriteParam();
IIOImage ioimage = new IIOImage(image, null, null);
param.setSOP(true);
param.setWriteCodeStreamOnly(true);
param.setProgressionType("layer");
param.setLossless(true);
param.setCompressionMode(J2KImageWriteParam.MODE_EXPLICIT);
param.setCompressionType("JPEG2000");
param.setCompressionQuality(0.01f);
param.setEncodingRate(1.01);
param.setFilter(J2KImageWriteParam.FILTER_53 );
writer.write(null, ioimage, param);
System.out.println(System.currentTimeMillis() - s);
writer.dispose();
ios.flush();
ios.close();
image.flush();
pageCounter++;
}
}
}
public static void main(String[] args) {
String input = "E:/IMGTEST/mail-DOC0002.pdf";
String output = "E:/IMGTEST/mail-DOC0002/docamail-DOC0002-";
ImageConverter imgcv = new ImageConverter();
try {
imgcv.compressor(input, output);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Related

How to convert one pdf to multiple png images with multithreading

I used the following method to convert a pdf into multiple png images：
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.ImageType;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.imgscalr.Scalr;
public class ImgUtil {
public static List<String> convertPDFPagesToImages(String sourceFilePath, String desFilePath){
List<String> urlList = new ArrayList<>();
try {
File sourceFile = new File(sourceFilePath);
File destinationFile = new File(desFilePath);
if (!destinationFile.exists()) {
destinationFile.mkdir();
log.info("Folder Created ->:{} ", destinationFile.getAbsolutePath());
}
if (sourceFile.exists()) {
log.info("Images copied to Folder Location: ", destinationFile.getAbsolutePath());
PDDocument document = PDDocument.load(sourceFile);
PDFRenderer pdfRenderer = new PDFRenderer(document);
int numberOfPages = document.getNumberOfPages();
log.info("Total files to be converting ->{} ", numberOfPages);
String fileName = sourceFile.getName().replace(".pdf", "");
String fileExtension = "png";
/*
* 600 dpi give good image clarity but size of each image is 2x times of 300 dpi.
* Ex: 1. For 300dpi 04-Request-Headers_2.png expected size is 797 KB
* 2. For 600dpi 04-Request-Headers_2.png expected size is 2.42 MB
*/
int dpi = 300;// use less dpi for to save more space in harddisk. For professional usage you can use more than 300dpi
for (int i = 0; i < numberOfPages; ++i) {
File outPutFile = new File(desFilePath + fileName +"_"+ (i+1) +"."+ fileExtension);
BufferedImage bImage = pdfRenderer.renderImageWithDPI(i, dpi, ImageType.RGB);
ImageIO.write(bImage, fileExtension, outPutFile);
urlList.add(outPutFile.getPath().replaceAll("\\\\", "/"));
}
document.close();
log.info("Converted Images are saved at ->{} ", destinationFile.getAbsolutePath());
} else {
log.error(sourceFile.getName() +" File not exists");
}
} catch (Exception e) {
e.printStackTrace();
}
return urlList;
}
public static void main(String[] args) {
convertPDFPagesToImages("D:\\tmp\\report\\pdfPath\\61199020100754118.pdf", "D:\\tmp\\report\\pdfPath\\");
}
}
But I found that when the number of pdf pages is relatively large, the image conversion is slower. I consider using multithreading to parse the images. Is it possible to convert a pdf into a picture through multiple threads or is there a similar method?

A simple way to speed up this conversion would be to split image writing to a background thread. Set up an executorService before opening the PDF:
ExecutorService exec = Executors.newFixedThreadPool(1);
List<Future<?>> pending = new ArrayList<>();
Instead of writing the image in same calling thread, just submit a new task to the service:
// ImageIO.write(bImage, fileExtension, outPutFile);
pending.add(exec.submit(() -> write(bImage, fileExtension, outImage.toFile())));
And function to perform the task:
private static void write(BufferedImage image, String fileExtension, File file) {
try {
ImageIO.write(image, fileExtension, file);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
After closing the PDF document make sure the executor is finished:
for (Future<?> fut : pending) {
fut.get();
}
exec.shutdown();
exec.awaitTermination(365, TimeUnit.DAYS);
Using more than one thread for ImageIO.write may not benefit you as it is heavy IO operation but as I said in the comment, experiment with writing to a large ByteArrayOutputStream and then the file may also help on your specific hardware.

How do you access an attachment stored as MIME Part?

It seems to me there are two ways to store an attachment in a NotesDocument.
Either as a RichTextField or as a "MIME Part".
If they are stored as RichText you can do stuff like:
document.getAttachment(fileName)
That does not seem to work for an attachment stored as a MIME Part. See screenshot
I have thousands of documents like this in the backend. This is NOT a UI issue where I need to use the file Download control of XPages.
Each document as only 1 attachment. An Image. A JPG file. I have 3 databases for different sizes. Original, Large, and Small. Originally I created everything from documents that had the attachment stored as RichText. But my code saved them as MIME Part. that's just what it did. Not really my intent.
What happened is I lost some of my "Small" pictures so I need to rebuild them from the Original pictures that are now stored as MIME Part. So my ultimate goal is to get it from the NotesDocument into a Java Buffered Image.
I think I have the code to do what I want but I just "simply" can't figure out how to get the attachment off the document and then into a Java Buffered Image.
Below is some rough code I'm working with. My goal is to pass in the document with the original picture. I already have the fileName because I stored that out in metaData. But I don't know how to get that from the document itself. And I'm passing in "Small" to create the Small image.
I think I just don't know how to work with attachments stored in this manner.
Any ideas/advice would be appreciated! Thanks!!!
public Document processImage(Document inputDoc, String fileName, String size) throws IOException {
// fileName is the name of the attachment on the document
// The goal is to return a NEW BLANK document with the image on it
// The Calling code can then deal with keys and meta data.
// size is "Original", "Large" or "Small"
System.out.println("Processing Image, Size = " + size);
//System.out.println("Filename = " + fileName);
boolean result = false;
Session session = Factory.getSession();
Database db = session.getCurrentDatabase();
session.setConvertMime(true);
BufferedImage img;
BufferedImage convertedImage = null; // the output image
EmbeddedObject image = null;
InputStream imageStream = null;
int currentSize = 0;
int newWidth = 0;
String currentName = "";
try {
// Get the Embedded Object
image = inputDoc.getAttachment(fileName);
System.out.println("Input Form : " + inputDoc.getItemValueString("form"));
if (null == image) {
System.out.println("ALERT - IMAGE IS NULL");
}
currentSize = image.getFileSize();
currentName = image.getName();
// Get a Stream of the Imahe
imageStream = image.getInputStream();
img = ImageIO.read(imageStream); // this is the buffered image we'll work with
imageStream.close();
Document newDoc = db.createDocument();
// Remember this is a BLANK document. The calling code needs to set the form
if ("original".equalsIgnoreCase(size)) {
this.attachImage(newDoc, img, fileName, "JPG");
return newDoc;
}
if ("Large".equalsIgnoreCase(size)) {
// Now we need to convert the LARGE image
// We're assuming FIXED HEIGHT of 600px
newWidth = this.getNewWidth(img.getHeight(), img.getWidth(), 600);
convertedImage = this.getScaledInstance(img, newWidth, 600, false);
this.attachImage(newDoc, img, fileName, "JPG");
return newDoc;
}
if ("Small".equalsIgnoreCase(size)) {
System.out.println("converting Small");
newWidth = this.getNewWidth(img.getHeight(), img.getWidth(), 240);
convertedImage = this.getScaledInstance(img, newWidth, 240, false);
this.attachImage(newDoc, img, fileName, "JPG");
System.out.println("End Converting Small");
return newDoc;
}
return newDoc;
} catch (Exception e) {
// HANDLE EXCEPTION HERE
// SAMLPLE WRITE TO LOG.NSF
System.out.println("****************");
System.out.println("EXCEPTION IN processImage()");
System.out.println("****************");
System.out.println("picName: " + fileName);
e.printStackTrace();
return null;
} finally {
if (null != imageStream) {
imageStream.close();
}
if (null != image) {
LibraryUtils.incinerate(image);
}
}
}

I believe it will be some variation of the following code snippet. You might have to change which mimeentity has the content so it might be in the parent or another child depending.
Stream stream = session.createStream();
doc.getMIMEEntity().getFirstChildEntity().getContentAsBytes(stream);
ByteArrayInputStream bais = new ByteArrayInputStream(stream.read());
return ImageIO.read(bais);
EDIT:
session.setConvertMime(false);
Stream stream = session.createStream();
Item itm = doc.getFirstItem("ParentEntity");
MIMEEntity me = itm.getMIMEEntity();
MIMEEntity childEntity = me.getFirstChildEntity();
childEntity.getContentAsBytes(stream);
ByteArrayOutputStream bo = new ByteArrayOutputStream();
stream.getContents(bo);
byte[] mybytearray = bo.toByteArray();
ByteArrayInputStream bais = new ByteArrayInputStream(mybytearray);
return ImageIO.read(bais);

David have a look at DominoDocument,http://public.dhe.ibm.com/software/dw/lotus/Domino-Designer/JavaDocs/XPagesExtAPI/8.5.2/com/ibm/xsp/model/domino/wrapped/DominoDocument.html
There you can wrap every Notes document
In the DominoDocument, there such as DominoDocument.AttachmentValueHolder where you can access the attachments.
I have explained it at Engage. It very powerful
http://www.slideshare.net/flinden68/engage-use-notes-objects-in-memory-and-other-useful-java-tips-for-x-pages-development

Can I tell what the file type of a BufferedImage originally was?

In my code, I have a BufferedImage that was loaded with the ImageIO class like so:
BufferedImage image = ImageIO.read(new File (filePath);
Later on, I want to save it to a byte array, but the ImageIO.write method requires me to pick either a GIF, PNG, or JPG format to write my image as (as described in the tutorial here).
I want to pick the same file type as the original image. If the image was originally a GIF, I don't want the extra overhead of saving it as a PNG. But if the image was originally a PNG, I don't want to lose translucency and such by saving it as a JPG or GIF. Is there a way that I can determine from the BufferedImage what the original file format was?
I'm aware that I could simply parse the file path when I load the image to find the extension and just save it for later, but I'd ideally like a way to do it straight from the BufferedImage.

As #JarrodRoberson says, the BufferedImage has no "format" (i.e. no file format, it does have one of several pixel formats, or pixel "layouts"). I don't know Apache Tika, but I guess his solution would also work.
However, if you prefer using only ImageIO and not adding new dependencies to your project, you could write something like:
ImageInputStream input = ImageIO.createImageInputStream(new File(filePath));
try {
Iterator<ImageReader> readers = ImageIO.getImageReaders(input);
if (readers.hasNext()) {
ImageReader reader = readers.next();
try {
reader.setInput(input);
BufferedImage image = reader.read(0); // Read the same image as ImageIO.read
// Do stuff with image...
// When done, either (1):
String format = reader.getFormatName(); // Get the format name for use later
if (!ImageIO.write(image, format, outputFileOrStream)) {
// ...handle not written
}
// (case 1 done)
// ...or (2):
ImageWriter writer = ImageIO.getImageWriter(reader); // Get best suitable writer
try {
ImageOutputStream output = ImageIO.createImageOutputStream(outputFileOrStream);
try {
writer.setOutput(output);
writer.write(image);
}
finally {
output.close();
}
}
finally {
writer.dispose();
}
// (case 2 done)
}
finally {
reader.dispose();
}
}
}
finally {
input.close();
}

BufferedImage does not have a "format"
Once the bytes have been translated into a BufferedImage the format of the source file is completely lost, the contents represent a raw byte array of the pixel information nothing more.
Solution
You should use the Tika library to determine the format from the original byte stream before the BufferedImage is created and not rely on file extensions which can be inaccurate.

One could encapsulate the BufferedImage and related data in class instance(s) like so:
final public class TGImage
{
public String naam;
public String filename;
public String extension;
public int layerIndex;
public Double scaleX;
public Double scaleY;
public Double rotation;
public String status;
public boolean excluded;
public BufferedImage image;
public ArrayList<String> history = new ArrayList<>(5);
public TGImage()
{
naam = "noname";
filename = "";
extension ="";
image = null;
scaleX = 0.0;
scaleY = 0.0;
rotation = 0.0;
status = "OK";
excluded = false;
layerIndex = 0;
addHistory("Created");
}
final public void addHistory(String str)
{
history.add(TGUtil.getCurrentTimeStampAsString() + " " + str);
}
}
and then use it like this:
public TGImage loadImage()
{
TGImage imgdat = new TGImage();
final JFileChooser fc = new JFileChooser();
FileNameExtensionFilter filter = new FileNameExtensionFilter("Image Files", "jpg", "png", "gif", "tif");
fc.setFileFilter(filter);
fc.setCurrentDirectory(new File(System.getProperty("user.home")));
int result = fc.showOpenDialog(this); // show file chooser
if (result == JFileChooser.APPROVE_OPTION)
{
File file = fc.getSelectedFile();
System.out.println("Selected file extension is " + TGUtil.getFileExtension(file));
if (TGUtil.isAnImageFile(file))
{
//System.out.println("This is an Image File.");
try
{
imgdat.image = ImageIO.read(file);
imgdat.filename = file.getName();
imgdat.extension = TGUtil.getFileExtension(file);
info("image has been loaded from file:" + imgdat.filename);
} catch (IOException ex)
{
Logger.getLogger(TGImgPanel.class.getName()).log(Level.SEVERE, null, ex);
imgdat.image = null;
info("File not loaded IOexception: img is null");
}
} else
{
imgdat = null;
info("File not loaded: The requested file is not an image File.");
}
}
return imgdat;
}
Then you have everything relevant together in TGImage instance(s).
and perhaps use it in an imagelist like so:
ArrayList<TGImage> images = new ArrayList<>(5);

Get Image from the document using Apache POI

I am using Apache Poi to read images from docx.
Here is my code:
enter code here
public Image ReadImg(int imageid) throws IOException {
XWPFDocument doc = new XWPFDocument(new FileInputStream("import.docx"));
BufferedImage jpg = null;
List<XWPFPictureData> pic = doc.getAllPictures();
XWPFPictureData pict = pic.get(imageid);
String extract = pict.suggestFileExtension();
byte[] data = pict.getData();
//try to read image data using javax.imageio.* (JDK 1.4+)
jpg = ImageIO.read(new ByteArrayInputStream(data));
return jpg;
}
It reads images properly but not in order wise.
For example, if document contains
image1.jpeg
image2.jpeg
image3.jpeg
image4.jpeg
image5.jpeg
It reads
image4
image3
image1
image5
image2
Could you please help me to resolve it?
I want to read the images order wise.
Thanks,
Sithik

public static void extractImages(XWPFDocument docx) {
try {
List<XWPFPictureData> piclist = docx.getAllPictures();
// traverse through the list and write each image to a file
Iterator<XWPFPictureData> iterator = piclist.iterator();
int i = 0;
while (iterator.hasNext()) {
XWPFPictureData pic = iterator.next();
byte[] bytepic = pic.getData();
BufferedImage imag = ImageIO.read(new ByteArrayInputStream(bytepic));
ImageIO.write(imag, "jpg", new File("D:/imagefromword/" + pic.getFileName()));
i++;
}
} catch (Exception e) {
System.exit(-1);
}
}

Using ImageIO to write JPEG 2000 with layers (i.e. decomposition levels)

Ok, here is our issue:
We are trying to convert a series of black and white .tiff files into jpeg2000 .jpf files, using imageio. We are always getting viewable .jpf files, but they usually do not have the specified number of layers or decomposition levels for zooming.
Here is our code:
//Get the tiff reader
Iterator<ImageReader> readerIterator = ImageIO.getImageReadersByFormatName("tiff");
ImageReader tiffreader = readerIterator.next();
//make an ImageInputStream from our tiff file and have the tiff reader read it
ImageInputStream iis = ImageIO.createImageInputStream(itemFile);
tiffreader.setInput(iis);
//just pass empty params to the tiff reader
ImageReadParam tparam;
tparam = new TIFFImageReadParam();
IIOImage img = tiffreader.readAll(0, tparam);
//set up target file
File f = new File(itemTargetDirectory.getAbsolutePath() + "/" + destFileName);
//we have tried FILTER_97 as well as different ProgressionTypes and compression settings
J2KImageWriteParam param;
param = new J2KImageWriteParam();
param.setProgressionType("layer");
param.setFilter(J2KImageWriteParam.FILTER_53);
//Our problem is that this param is not always respected in the resulting .jpf
param.setNumDecompositionLevels(5);
//get the JPEG 2000 writer
Iterator<ImageWriter> writerIterator = ImageIO.getImageWritersByFormatName("JPEG 2000");
J2KImageWriter jp2kwriter = null;
jp2kwriter = (J2KImageWriter) writerIterator.next();
//write the jpf file
ImageOutputStream ios = ImageIO.createImageOutputStream(f);
jp2kwriter.setOutput(ios);
jp2kwriter.write(null, img, param);
It has been an odd experience, as the same code has behaved differently on subsequent runs.
Any insights will be appreciated!

Do all the TIFF files have the same settings (color model)? J2KImageWriter.java shows the decomposition levels getting set (forced) to zero when indexed-color or multi-pixel packed source images are used as input.

Drew was on the right track, and here is the code that ended up sorting things out for us:
public void compressor(String inputFile, String outputFile) throws IOException {
J2KImageWriteParam iwp = new J2KImageWriteParam();
FileInputStream fis = new FileInputStream(new File(inputFile));
BufferedImage image = ImageIO.read(fis);
fis.close();
if (image == null)
{
System.out.println("If no registered ImageReader claims to be able to read the resulting stream");
}
Iterator writers = ImageIO.getImageWritersByFormatName("JPEG2000");
String name = null;
ImageWriter writer = null;
while (name != "com.sun.media.imageioimpl.plugins.jpeg2000.J2KImageWriter") {
writer = (ImageWriter) writers.next();
name = writer.getClass().getName();
System.out.println(name);
}
File f = new File(outputFile);
long s = System.currentTimeMillis();
ImageOutputStream ios = ImageIO.createImageOutputStream(f);
writer.setOutput(ios);
J2KImageWriteParam param = (J2KImageWriteParam) writer.getDefaultWriteParam();
IIOImage ioimage = new IIOImage(image, null, null);
param.setSOP(true);
param.setWriteCodeStreamOnly(true);
param.setProgressionType("layer");
param.setLossless(false);
param.setCompressionMode(J2KImageWriteParam.MODE_EXPLICIT);
param.setCompressionType("JPEG2000");
param.setCompressionQuality(0.1f);
param.setEncodingRate(1.01);
param.setFilter(J2KImageWriteParam.FILTER_97);
writer.write(null, ioimage, param);
System.out.println(System.currentTimeMillis() - s);
writer.dispose();
ios.flush();
ios.close();
image.flush();
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert PDF to JPG2000 file(s) - java

Related

How to convert one pdf to multiple png images with multithreading

How do you access an attachment stored as MIME Part?

Can I tell what the file type of a BufferedImage originally was?

Get Image from the document using Apache POI

Using ImageIO to write JPEG 2000 with layers (i.e. decomposition levels)

Categories

Resources