Converting a PDF into multiple JPGs with iText or other - java

I have the need to convert any multipage PDF file into a set of JPGs.
Since the PDF files are supposed to come from a scanner, we can assume each page just contains a graphic object to extract, but I cannot be 100% sure of that.
So, I need to convert any renderable content from each page into a single JPEG file.
How can I do this with iText?
If I can't do this with iText, what Java library can achieve this?
Thanks.

Ghostscript (available for Windows, Linux, MacOS X, Solaris, AIX,...) can convert...
...from input formats: PDF, PostScript, EPS and AI
...into output formats: JPEG, TIFF, PNG, PNM, PPM, BMP, (and more).
(The ImageMagick mentioned above doesn't do the conversion on its own -- it uses Ghostscript under the hood, as do many other tools.)

ICEpdf - http://www.icepdf.org/ - has an open source entry version which should do what you need.
I believe the primary difference between the open source version and the pay-for version is that the pay-for has much better font support.

You can also use Sun's PDF-Renderer and JPedal does PDF to image (low and high res.

With Apache PDFBox you could do the following:
PDDocument document = PDDocument.load(pdffile);
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 72);
ImageIO.write(image, "jpg", new File(pdffile.getAbsolutePath() + "_" + i + ".jpg"));
}

Related

add Inline image in pdf document using itext library

we have created an application to generate pdf documents using itext 5 library. As the part of pdf generation, we tried to embed an image inline in pdf which should be non editable and read only. We tried with an addImage method of PdfContentByte as below,
byte[] decoded = Base64.getDecoder().decode(encodedImage);
image = Image.getInstance(decoded);
After this image is retrieved, used the same in addImage method.
PdfContentByte canvas = pdfStamper.getOverContent(item.getPage(0));
canvas.addImage(image, Boolean.TRUE);
Findings : Since the image is in Base 64 string format, the image is not displayed in the resultant pdf document (if the image is not in Base 64 format, it is working fine).
when we open the pdf , the below error is shown :-------> "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem."
how can we handle this situation ?.
Is any other way to achieve this requirement. Please help
Code :-
PdfReader resultantPdfReader = new PdfReader("template.pdf");
PdfStamper resultandPdfStamper = new PdfStamper(resultantPdfReader, new FileOutputStream("A13.pdf"));
AcroFields acroFields = resultantPdfReader.getAcroFields();
Rectangle fieldPosRec = acroFields.getFieldPositions("imageField").get(0).position;
String encodedSignature = "";
encodedSignature = new String(Files.readAllBytes(Paths.get("MyImage.png")));
if(encodedSignature.indexOf("data:image/png;base64,") != -1) {
encodedSignature = encodedSignature.substring("data:image/png;base64,".length());
}
Image image = null;
try {
byte[] decoded = Base64.getDecoder().decode(encodedSignature);
image = Image.getInstance(decoded);
} catch (Exception e) {
e.printStackTrace();
}
image.scaleAbsoluteHeight(fieldPosRec.getHeight());
image.scaleAbsoluteWidth(fieldPosRec.getWidth());
image.scaleToFit(fieldPosRec);
acroFields.removeField("imageField");
image.setAbsolutePosition(fieldPosRec.getLeft(), fieldPosRec.getBottom());
PdfContentByte canvas = resultandPdfStamper.getOverContent(item.getPage(0));
canvas.addImage(image, Boolean.TRUE);
resultandPdfStamper.close();
resultantPdfReader.close();
Base64 Image String : iVBORw0KGgoAAAANSUhEUgAAAU4AAACWCAYAAACxSWGfAAAAAXNSR0IArs4c6QAAFJFJREFUeAHtnQmwZFV9hw37DgoiaJgZsGBAFgkSIEJgjAZRAliFikrEoOgoGkQlEFJKiXFhr5RCAKMjJhBETAoRImjAB4KAKBgEDBBhQM2IyEDYh2XM90Gf8k7T3a/fe9093X1//6rv3XOXvsvX9/773HNP93vBCxIxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAMxEAPNBl7MhEObJ2Y8BmIgBmKgtYG9mTwBv2vAIBEDMRADMdDKwCuYeCmUhFmGrZbNtBiIgRhoa2AN5uwDZ8L3oCST6Q6vYR3nwnyYC8MQ67MTp8JT4HEthsMbZccTMRADMTCpgU1Z4kPwbXgCppsku3ndIta/vBLpymz7w2CidF9NnCZQE6lR9v+5sfyNgRiIgYqBlSjvASfALVAShsNn4Br4OPwRzDSsZc4Hk+UiqG7LstOcNx/mQj9iM1Z6LFwNZfveonurXmJVCmVemZZhDMRAzQ1swPG/E74GD0BJEg4dPw8OAp8u9zNMjibJc2ERVPfDstOcNx/mwnRjLV54MFwBS6Fsx3X7MKg5jmeCyxzXPCPjnQ38QefZmRsDI2dge/bYJCE7wwpQ4mcULm5wFcOny4wBD02O8yps1LR9b6sfhvsa/GaS8k7M/yvYH9YE41H4NzgLJsAEWQ3dWBs1Xg0/fLaUPzEQA7UysAdH+2UotSyHtl1eAn8Nm8Kwhol0PlgzvA6qxzCVsrVMa5sHg7XPdrEqM24F153aZjtLHaanxtlBTmaNhAFvs0+EdzX29iaGJp+L4DKw5jVqYa1xQ/DYpF3ZeS77A/gJfBXuhMnCW/QjwRq4bbpLIBEDMVADA37ovwfuB2tOj8MnYBVItDfgLbpNFLJT+8UyJwZiYNwMbM0BfR/KbaxPi18+bgfZ4+Pxg+bNcCPkFr3HcrO63hlYl1VtA3vBe+FYsA3uO3Ar/Aouh9PhjbAaJDobWIPZn4MnwYt/EbwNEp0NvI7Z10P5oDmasu2ciRgYGgOvYk9OgnKSdju0Le6bcAhsDIllDezN6F2gz2fgNPDDKdHewI7M+i6Uc9AP6/eBfVoTMTAUBjwZj4HylTbb3qxZWsNcANY4rXlaA7Um6kW/A/gaawNLoZzglp3mvF50yGY1IxsvY8+/AcXNDZT/eGSPZjA7vgWb+TqUc+oBykfB6pCIgaExsCV7Yj84L25P1lNgqrfe1jJNrNY6rX2WROHwHNgX6hQrcrAfhodABw79brXTE60NvJTJZ0L58H6M8nHwQkjEwNAYsMH9MPAE9eJeCPNgpmHS9db0DLCGVZLoBGVvv8Y9rFFWj9uO3NY8E60NmBhNkOU8NHGaQE2kiRgYKgObsDeXQUlqCyiv04c9tAngQ3AfuC1rtGfDLBi3sPniVLAN02O9C/wASbQ24K23t+CLoZwb3qJ7q56IgaEzcBB79CB4st4L+0K/w6RireJxcLsOfcLcj2TNagceB7DF/wWPzafmHqtP0RPPN+CHqQ95fNijL/ku1OFuhMNMjJqBF7PD/w7lZLXstEHGbDZmm6c1T/fDmugHwYtpFMP+l5dCcXoVZR+eJZY1YLPQLnAyXAfF1/WUXweJGBhKA9YqrV16wlrbtNa5PMPaxRVQLqD/przf8tyhKW57FZb3Z9xKDdpeCIeACSLxnIFqsrybSeW9dngevAXiCwmJ4TPgrfACKCet7Zq2bw5LmCxvg7J/E5SH/ZZtJ/bRC7/s81cpD7rmziaHMjoly3vY41PAmmcS5lC+fdkpDcyDheAF7hPLw2AYT1hv00fhAZLfVrHt0u9J6/QCmAd1jyTLup8BY3L8q3EcfrKXdkT7aM4dgWNr9QDJRGWteXmHtcxbwYRp4jweTKR1jSTLur7zY3rca3FcJ4IXuP3hjgFrdKMUs9jZYXmA1FzLNHnuPEoye7iv1WTpbbfnWCG34T0UnVUN3sD5jZN5gqHfOx/laH6AdCUHczRsNKCDSi3zuaad8jQ8yXJAJ142M1gDR7I5awA+Nd98sJvu69Z8gGTH6FK7sSbtt3H2hH602da1lrk6Pv2wejf8A1wOd0Dx7tDkaZeiPOBBQmL0DbyWQ7DtzXbNfUf/cJ53BCbI14MJ08RZLuY7KfeyFlqHWqYuNwU/kD4B3qXYs6F846m4LUMTaJIlEhLjZWAWh3MfeKJ/arwOreXReKv+d2DSLBf3TGuh41rL9MHarvABOB2uhvLDI8VdGfqNp5+C7ctHwRvgZZCIgbEz4BP068GT/z9gBahLdKqFmli7bQsdh1qmv7o0F+xU/vfgr1TdBSUpNg/9euglcCK8E14JdupPxEAtDHyZo/Si+DnU+ee3plMLHdVa5ga8138Gh8MC+BE8Bs3J0XG/2eR8l3N5X+frEzFQWwPv58i9OB6F7WprYdkDb1cLbZVUyrRh7ZdpDdD39S/hBLCGaE2x7HfzcCHzLoRPw1thS7AmmoiBGGgY2IXhEvDieUdjWgbLGii1UB8oNSeZMm5H9mHol2lb4l5wJJwNN4FtjmU/q0PbKG2rtM3yUNgNbMtMxEBXBqxd1DFewkH/GLzY7DbyEUiMhgFrkVuBbYoF30drh82xlAn/AybRKgsZN5EmYmBaBuqYOP0W0GWwO1wJpRsSxcSQGdiQ/fFWuyRIhybNlaE5bmeCt+HVBHkL47ZdJmKgpwZG7auEvTj4k1iJSfNXYDuW7XOJ5WvA83ALqCZIyxu32C1rkfaZ/K8mftli2UyKgb4YqFuN07ZM+9jZ9rUHXAuJwRqw50JzLXIbpvmEvjlsi7QGWU2SNzOeWmSzqYwP1ECdapxerP/UsHsYwyTN/p5q9od9OTTXIme12KztjXdCNUFaXghpi0RCYrgM1KXGaS3HvnibwQJ4DyR6Z2BtVrUtVJOk42u22IS1Rb9dU02S1iofbrFsJsXAUBqoQ+K05nMR+PU3k6ddT+yGlJiegTm8rJogLfuB1Opc+gXTqwnSsk+5badMxMDIGqjDrfoneXdMmr+F/SFJEwldxOosY9tjNUna3LFui9fq1CfYzUnygRbLZlIMjLyBVrWEkT+oygHsS/kCsIazJ1wOiecbsB9kNUFa9im3tfXm+DUTmhOkT7nTO6HZVMZjYAQNeOE/CD5c+JsR3P9+77Lft9bLhaCjZux5YNvjv8AR8OewISRiIAbG1IA16c+DycAf8E383sCrKf4zPAElWfoDJ34p4BR4F2wPq0AiBmKgRgZ251hNCn6tcq0aHXe7Q9XB++AnUJLlM5R9aPZGaHVLzuREDMRAnQycycGaID5Tp4NucaxbM+0L8H9QEua9lD8LsyERAzEQA88a8BZzMZgoXvHslHr9WZnDPQCugJIsHX4f3g65BUdCIgZiYFkDb2LURHHjspPHfmwWR2gN26feJWE+RPk0sFtRIgZiIAbaGjifOSaOj7VdYnxm2Da5F/hk3DbLkjDtLvR+SPsuEhIxEAOdDazL7MfBJGLfxHENuxIdCT+Hkix9Sn427AqJGIiBGOjawLtZ0kRyWdevGK0F7Upkv8pqV6I7GT8KTKaJGIiBGJiyAROmifPgKb9yeF/g7fZ8aO5K9C2mpSvR8L5v2bMYGAkD3pp7i+6t+jojscedd9KuRKdCuhJ19pS5MRADMzBwBK+1tjnK3xSyq9DboLkr0ZVMS1ciJCRiIAZ6a6Dcyu7X29UOZG2z2YpdieycbvKX0pXImmciBmIgBnpuwORisrkfRqWDt12J/Lm7Vl2JbNNMVyIkJGIgBvpn4LOs2sR5Rv820dM1b8La/EGNUrssXYl8ap6IgRiIgYEYuIitmIT+dCBbm9lGtuPl/kdG9/eHYH/MdCVCQiIGYmCwBi5hcyaiXQa72Slv7bW8ojwln6C83pTXkBfMxIA/N+g5cjJ4ziRioNYGvBBMnA6HNQ5kx5aA+/k1aPXvcJmc6LGBarK8m3Xrv7BVj7eV1cXASBmwFuHFcA94oQxb+M2epeA+ngTDuI/s1thEp2TpOWL7sudM3oexectzINMx4AXgBWFi8oIYlvDJub9O5H7ZOf8wSPTHQDVZlnNB75Jk2R/nWesYGBi22/XVcXoBeOH6bSb/w2aitwaSLHvrM2uroYFyu/4bjv1zsM1ydOBT8mvApHk/5BeLkNCDMFH6vn4QzoPypYdqzdIP0NyGIyERA90Y8KLaB8pF5PAm+FuYDYOKzdjQ7eD274ItITE9AyvyslfBR8Haux9Ceq0ywXiSJRISMTBdAyZP+3KeDr+FcoH5YMZ/HfEB6GefyR1Z/73gdm+AjSDRvQG/9eWXAI6Gb4NfOy3vYRnezTR/Wu+9MBcSMRADPTTg/975C/hXeBTKhfcU5YvgHbAm9Cr8ibdHwO1cAmtBorMB24FfA5+Ey+ExKO9TGd7BtC/BQTAHEjEQAwMyYIK0H+XFYOIsF6WJ7hzYG0y0041DeOHT4Hq/AitB4vkG1mGS/+rDNuir4Uko74VD7wxuBnsiHAAbQyIGYmAIDGzAPhwKV4EXarlwvbX/R9gNvOXvNo5lwbKOT3X7opostz7H6a9V2W/yR1A+XIovx3/cmP8mhi6fiIGhNzCVBDH0BzONHZzNa7xll20qr19E+T5Y2IYHmG6t8otwMNhH03+O9iWoc1hDtI15D9gdtobqOWZt3wR6BVwJ1jptx0zEwEgZqJ7UI7XjfdjZbVmnCXQH2HOS9Xux625tMBmcBT7MWNjAxFqH8IPHBFkS5eZNB/0449dBSZTXUrYdMxEDI20gibP12/ciJs/pgAmzU5hYF3ZgVBPrFhxTNVHOYrwaDzPyAyiJ8nrKT1YXSDkGxsFAEufU3kX7ZFqznAN+je848MnwnArWwnwI0ikmS6zOXxVWaQy7LXe7XLv1dnq9TRPOr8ZiRmwrLonyRso2WyRiYKwNJHF2//buyqIXwovAW859wAdKrcJl5rShm8Taap3DMM3fD70bSqL0CfjvhmHHsg8xMEgDSZzd2d6fxc6G1eCb8Haw/W660SmxrstK7ThvzW1JA293S9lhdXyq5Zm83vbcRAzEQAxMauAwljCJWbOyf+EKkIiBGIiBGGhhwNr4iWDCXAp+5z0RAzEQAzHQxoAPT84Fk6a3wgdCIgZiIAZioI2B9Zg+ASZN/z+Q/ycoEQMxEAMx0MaAXYu+ACbNX8J2kIiBGIiBGOhgwO+qmzS/A5t0WC6zYiAGYiAGMGC/TJPmE/BKSMRADMRADHQwYL9Jf9TDxHl4h+UyKwZiIAZiAAN2O7oUTJoO86UAJCRiIAZioJOBjzLTpOk/ebPmmYiBGIiBGOhgwLZM2zRNnLZxJmIgBmIgBjoYsOvRLWDS9KuUiRiIgRiIgUkMlK5HJk+TaCIGYiAGYqCDgXQ96iAns2IgBmKg2UC6HjUbyXgMxEAMdDCQrkcd5GRWDMRADLQy8BEmputRKzOZFgMxEAMtDKTrUQspmRQDMRAD7Qyk61E7M5keAzEQA20MpOtRGzGZHAMxEAOtDKTrUSsrmRYDMRADbQxUux75YCgRA3U2sBMHf1ydBeTYJzeQrkeTO8oS9TCwKodpwnwa8rsM9XjPp32U6Xo0bXV54RgZsJZ5K5gwTZzHg4k0MQMD4/rbk3Y9ug48QfaFb0FitAz4n0bXqrB2pTzZ9FbLei74z/fE/yf1i6Zhmeb8cQiP91g4AlaEn8HB4HWRmKGBlWb4+mF8uR8G88ETx6fpSZpI6HN0k+RaJTMTYKvpTlu5D/u8Jut8KWzVYd2PMK9dUh325DqXfZ/XYHuGW8IzcAIcA0sg0QMD41jj3AUv18CNsCs8DompG/BDdUPwAVth40q5TDPJvQR6HU+xQpPYw42h5UKrad0suwbr+MMG/jM+y81Dk/lk4baGJblamzwQ9oa3QjXOYOQsSC2zaqUH5XGscb6l4eV7DJM0n3+SvJBJJek1D6uJcQOW6/aD9VGWfRJaJbRupzUnPtfX63CdD8LNHVa8LvOak2nzuMnVWmunmut/Mv8mOB9MXLYx9ipKspzHCr39LuFPJP4UJhrcxjDRBwPdXhh92HRfVunx3A2e6H8C10IdYjUO0lpfNfE1J8UybhNGN/EMC/kP7H7dhkWV6Q9RrlN0Sq7WYtcHa+sl7qHwDZhpEi0J05qlNcwSX6EwAeeA71uizwbGLXHugi9v072Nmg29/JRndQONFdiatb6S8By2S4zrTWHPTHImw2ria5Uc/R9MS6ew3iz6ewNeVzuDCe7N4Ad5Cc9NE+hUkmhJmB/ndZs3VnQxw69DkmVDyCAH45Y4T0ae/3ztFPjYIEVOYVu2CVaTYbVcTYzWWLxguglvQe+FagJslxjTfNGN0d4tM90kOpddmNdgW4Zbg3EHfBqSMLWxnGLcEucleHw9nAufh+th1SZ8Atw8rTrez/m2KbutbsLa8v1QTYbV8qLKvMXdrDDLLHcDJYnaDi/VmuhpjN8FO8IWsANUw9qltcwkzKqV5VQet8RpY/2ty8llt5v1FvgRqCa+akIsZWuQPllOjKeBuRzWfPB23vbp5ge1NzDtdphocBvDxJAYGLfEWf1E99bmNfA0LAFvZx22Y1Dzkwx5E2oYJsp5sB/sBjbZNMdFTJAJSKJEQiIGYqC+Bmx2aYVPwK1VngTbQSIGYiAGYqBhoJo0TZQ+Ud8f1mnMzyAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGYiAGRs3A/wMpp3D77pc5JQAAAABJRU5ErkJggg==
Your example image has properties that iText cannot properly translate into an inlined image. Unfortunately it does not recognize this and outputs an erroneous result PDF.
In particular your image file uses transparency. Inline images don't allow for Mask or SMask entries (which images in PDFs use to represent transparency). Thus, your image as is cannot be used as inline image.
As a result the inline image created by iText only consists of a black rectangle while the transparency information (which contains the line drawing) is dropped.
Furthermore, your image uses a calibrated RGB color space. Such calibrated RGB color spaces cannot be inlined themselves, so the color space definition has to be put into the page resources. iText, though, when creating an inlined image, fails to reference that non-inlined part properly.
As a result the inline image created by iText references a color space by the wrong name, causing the "An error exists on this page" error message by Adobe Reader. Fixing that reference one gets a valid result PDF showing the black rectangle mentioned above.
In a comment you explain that your actual objective is to prevent copying image from the generated pdf document.
In general this obviously is not possible - any information a PDF viewer can access to draw on screen or paper also can be accessed by some PDF processor designed for that task to copy to some file. (Let's ignore proprietary, viewer specific DRM extensions here.)
What you can try, though, is draw your data in such a way that the common PDF viewers don't offer to copy it. Trying to use an inline image as you did is one approach in that direction. Other approaches wrap the image in other structures, e.g. in a pattern:
PdfContentByte canvas = resultandPdfStamper.getOverContent(1);
Rectangle pageSize = resultantPdfReader.getPageSize(1);
PdfPatternPainter painter = canvas.createPattern(pageSize.getWidth(), pageSize.getHeight());
painter.addImage(image);
canvas.setColorFill(new PatternColor(painter));
canvas.rectangle(0, 0, pageSize.getWidth(), pageSize.getHeight());
canvas.fill();
(AddImageInPattern test testAddToPageTest3)
Adobe Acrobat Reader here does not offer to copy that image. Also my (admittedly older) Adobe Acrobat Pro does not offer to copy it, merely to remove it (more exactly, remove the whole rectangle filled with the pattern).
Beware, though, what the common PDF viewers do or don't offer is a moving target...

How to detect OCR in a scanned Document with pdfbox 2.0.0?

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.
I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)
Thanks in advance.
Here's how to find out if fonts are on the top level of a page:
PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0); // 0 based
PDResources resources = page.getResources();
for (COSName fontName : resources.getFontNames())
{
System.out.println(fontName.getName());
}
doc.close();
Re: mkl suggestion, here's how to extract text:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // 1 based
stripper.setEndPage(1);
String extractedText = stripper.getText(doc);
System.out.println(extractedText);

java - generate unicode pdf with Apache PDFBox

I have to generate pdf in my spring mvc application. recently I tested iTextPdf library, but i could not generate unicode pdf document. in fact I didn't see non-latin characters in the generated document. I decided to use Apache PDFBox for my purpose, but I don't know has it support unicode characters? If has, is there any good tutorial for learning pdfBox? And If not, which library should I use?
Thanks in advance.
The 1.8.* versions don't support PDF generation with Unicode, but the 2.0.* versions do. This is the example EmbeddedFonts.java:
public class EmbeddedFonts
{
public static void main(String[] args) throws IOException
{
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
String dir = "../pdfbox/src/main/resources/org/apache/pdfbox/resources/ttf/";
PDType0Font font = PDType0Font.load(document, new File(dir + "LiberationSans-Regular.ttf"));
PDPageContentStream stream = new PDPageContentStream(document, page);
stream.beginText();
stream.setFont(font, 12);
stream.setLeading(12 * 1.2);
stream.newLineAtOffset(50, 600);
stream.showText("PDFBox Unicode with Embedded TrueType Font");
stream.newLine();
stream.showText("Supports full Unicode text ?");
stream.newLine();
stream.showText("English русский язык Tiếng Việt");
stream.endText();
stream.close();
document.save("example.pdf");
document.close();
}
}
Note that unlike iText, PDFBox support for PDF creation is very low level, i.e. we don't support paragraphs or tables out of the box. There is no tutorial, but a lot of examples. The API orients itself on the PDF specification.
The current version of Apache PDFBox can't deal with Unicode, see:
https://pdfbox.apache.org/ideas.html
iTextPdf v. 5.x generates pdf files with Unicode. There is an exemple here:
iText in Action: Chapter 11: Choosing the right font
part3.chapter11.UnicodeExample
http://itextpdf.com/examples/iia.php?id=199
To run it, you just need to adapt the value of EncodingExample.FONT and to add some code to create the output file.

OpenType font kerning with itext

I am using itext and ColdFusion (java) to write text strings to a PDF document. I have both trueType and openType fonts that I need to use. Truetype fonts seem to be working correctly, but the kerning is not being used for any font file ending in .otf. The code below writes "Line 1 of Text" in Airstream (OpenType) but the kerning between "T" and "e" is missing. When the same font is used in other programs, it has kerning. I downloaded a newer version of itext also, but the kerning still did not work. Does anyone know how to get kerning to work with otf fonts in itext?
<cfscript>
pdfContentByte = createObject("java","com.lowagie.text.pdf.PdfContentByte");
BaseFont= createObject("java","com.lowagie.text.pdf.BaseFont");
bf = BaseFont.createFont("c:\windows\fonts\AirstreamITCStd.otf", "" , BaseFont.EMBEDDED);
document = createobject("java","com.lowagie.text.Document").init();
fileOutput = createObject("java","java.io.FileOutputStream").init("c:\inetpub\test.pdf");
writer = createobject("java","com.lowagie.text.pdf.PdfWriter").getInstance(document,fileOutput);
document.open();
cb = writer.getDirectContent();
cb.beginText();
cb.setFontAndSize(bf, 72);
cb.showTextAlignedKerned(PdfContentByte.ALIGN_LEFT,"Line 1 of Text",0,72,0);
cb.endText();
document.close();
bf.hasKernPairs(); //returns NO
bf.getClass().getName(); //returns "com.lowagie.text.pdf.TrueTypeFont"
</cfscript>
according the socalled spec: http://www.microsoft.com/typography/otspec/kern.htm
OpenType™ fonts containing CFF outlines are not supported by the 'kern' table and must use the 'GPOS' OpenType Layout table.
I checked out the source, IText implementation only check the kern for truetype font, not read GPOS table at all, so the internal kernings must be empty, and the hasKernPairs must return false.
So, there have 2 way to solove:
get rid of the otf you used:)
patch the truetypefont by reading the GPosition table
wait for me, I'm processing the cff content, but PDF is optional of ever of my:) but not exclude the possibility:)
Have a look at this thread about How to use Open Type Fonts in Java.
Here is stated that otf is not supported by java (not even with iText). Otf support depends on sdk version and OS.
Alternatively you could use FontForge which converts otf to ttf.

A good library for converting PDF to TIFF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need a Java library to convert PDFs to TIFF images. The PDFs are faxes, and I will be converting to TIFF so that I can then do barcode recognition on the image. Can anyone recommend a good free open source library for conversion from PDF to TIFF?
I can't recommend any code library, but it's easy to use GhostScript to convert PDF into bitmap formats. I've personally used the script below (which also uses the netpbm utilties) to convert the first page of a PDF into a JPEG thumbnail:
#!/bin/sh
/opt/local/bin/gs -q -dLastPage=1 -dNOPAUSE -dBATCH -dSAFER -r300 \
-sDEVICE=pnmraw -sOutputFile=- $* |
pnmcrop |
pnmscale -width 240 |
cjpeg
You can use -sDEVICE=tiff... to get direct TIFF output in various TIFF sub-formats from GhostScript.
Disclaimer: I work for Atalasoft
We have an SDK that can convert PDF to TIFF. The rendering is powered by Foxit software which makes a very powerful and efficient PDF renderer.
we here also doing conversion PDF -> G3 tiffs with high and low res. From my experience the best tool you can have is Adobe PDF SDK, the only problem with it is its insane price. So we don't use it.
what works fine for us is ghostscript, last versions are pretty much robust and do render correctly majority of the pdfs. And we have quite a few of them coming during the day. In production conversion is done using the gsdll32.dll; but if you want to try it use the following command line:
gswin32c -dNOPAUSE -dBATCH -dMaxStripSize=8192 -sDEVICE=tiffg3 -r204x196 -dDITHERPPI=200 -sOutputFile=test.tif prefix.ps test.pdf
it would convert your PDF into the high res G3 TIFF. and prefix.ps code is here:
<< currentpagedevice /InputAttributes get
0 1 2 index length 1 sub {1 index exch undef } for
/InputAttributes exch dup 0 <</PageSize [0 0 612 1728]>> put
/Policies << /PageSize 3 >> >> setpagedevice
another thing about this sdk is that it's open source; you're getting both c and ps (postscript) source code for it. Also if you're going with another tool check what kind of an engine they have to power the pdf rendering, it could happen they are using gs for it; like for instance LeadTools does.
hope this helps, regards
You can use the icepdf library (Apache 2.0 License).
They even provide this exact use case as one of their example source code:
http://wiki.icesoft.org/display/PDF/Multi-page+Tiff+Capture
Maybe it is not neccessary to convert the PDF into TIFF. The fax will most likely be an embedded image in the PDF, so you could just extract these images again. That should be possible with the already mentioned iText library.
I don't know if this is easier than the other approach.
Take a look at Apache PDFBox - A Java PDF Library
No Itext can not convert PDFs to Tiff.
However, there are commercial libraries that can do that. jPDFImages is a 100% java library that can convert PDF to images in TIFF, JPEG or PNG formats (and maybe JBIG? I am not sure). It can also do the reverse, create PDF from images. It starts at $300 for a server.
Here is a good article and wrapper classes for using GhostScript with C# .NET...ended up using this in production
http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx
I have some great experience with iText (now, I'm using 5.0.6 version) and this is the code for tiff convertion into pdf:
private static String convertTiff2Pdf(String tiff) {
// target path PDF
String pdf = null;
try {
pdf = tiff.substring(0, tiff.lastIndexOf('.') + 1) + "pdf";
// New document A4 standard (LETTER)
Document document = new Document(PageSize.LETTER, 0, 0, 0, 0);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(pdf));
int pages = 0;
document.open();
PdfContentByte cb = writer.getDirectContent();
RandomAccessFileOrArray ra = null;
int comps = 0;
ra = new RandomAccessFileOrArray(tiff);
comps = TiffImage.getNumberOfPages(ra);
// Convertion statement
for (int c = 0; c < comps; ++c) {
Image img = TiffImage.getTiffImage(ra, c + 1);
if (img != null) {
System.out.println("page " + (c + 1));
img.scalePercent(7200f / img.getDpiX(), 7200f / img.getDpiY());
document.setPageSize(new Rectangle(img.getScaledWidth(), img.getScaledHeight()));
img.setAbsolutePosition(0, 0);
cb.addImage(img);
document.newPage();
++pages;
}
}
ra.close();
document.close();
} catch (Exception e) {
logger.error("Convert fail");
logger.debug("", e);
pdf = null;
}
logger.debug("[" + tiff + "] -> [" + pdf + "] OK");
return pdf;
}

Categories