A good library for converting PDF to TIFF? [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I need a Java library to convert PDFs to TIFF images. The PDFs are faxes, and I will be converting to TIFF so that I can then do barcode recognition on the image. Can anyone recommend a good free open source library for conversion from PDF to TIFF?

I can't recommend any code library, but it's easy to use GhostScript to convert PDF into bitmap formats. I've personally used the script below (which also uses the netpbm utilties) to convert the first page of a PDF into a JPEG thumbnail:
#!/bin/sh
/opt/local/bin/gs -q -dLastPage=1 -dNOPAUSE -dBATCH -dSAFER -r300 \
-sDEVICE=pnmraw -sOutputFile=- $* |
pnmcrop |
pnmscale -width 240 |
cjpeg
You can use -sDEVICE=tiff... to get direct TIFF output in various TIFF sub-formats from GhostScript.

Disclaimer: I work for Atalasoft
We have an SDK that can convert PDF to TIFF. The rendering is powered by Foxit software which makes a very powerful and efficient PDF renderer.

we here also doing conversion PDF -> G3 tiffs with high and low res. From my experience the best tool you can have is Adobe PDF SDK, the only problem with it is its insane price. So we don't use it.
what works fine for us is ghostscript, last versions are pretty much robust and do render correctly majority of the pdfs. And we have quite a few of them coming during the day. In production conversion is done using the gsdll32.dll; but if you want to try it use the following command line:
gswin32c -dNOPAUSE -dBATCH -dMaxStripSize=8192 -sDEVICE=tiffg3 -r204x196 -dDITHERPPI=200 -sOutputFile=test.tif prefix.ps test.pdf
it would convert your PDF into the high res G3 TIFF. and prefix.ps code is here:
<< currentpagedevice /InputAttributes get
0 1 2 index length 1 sub {1 index exch undef } for
/InputAttributes exch dup 0 <</PageSize [0 0 612 1728]>> put
/Policies << /PageSize 3 >> >> setpagedevice
another thing about this sdk is that it's open source; you're getting both c and ps (postscript) source code for it. Also if you're going with another tool check what kind of an engine they have to power the pdf rendering, it could happen they are using gs for it; like for instance LeadTools does.
hope this helps, regards

You can use the icepdf library (Apache 2.0 License).
They even provide this exact use case as one of their example source code:
http://wiki.icesoft.org/display/PDF/Multi-page+Tiff+Capture

Maybe it is not neccessary to convert the PDF into TIFF. The fax will most likely be an embedded image in the PDF, so you could just extract these images again. That should be possible with the already mentioned iText library.
I don't know if this is easier than the other approach.

Take a look at Apache PDFBox - A Java PDF Library

No Itext can not convert PDFs to Tiff.
However, there are commercial libraries that can do that. jPDFImages is a 100% java library that can convert PDF to images in TIFF, JPEG or PNG formats (and maybe JBIG? I am not sure). It can also do the reverse, create PDF from images. It starts at $300 for a server.

Here is a good article and wrapper classes for using GhostScript with C# .NET...ended up using this in production
http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx

I have some great experience with iText (now, I'm using 5.0.6 version) and this is the code for tiff convertion into pdf:
private static String convertTiff2Pdf(String tiff) {
// target path PDF
String pdf = null;
try {
pdf = tiff.substring(0, tiff.lastIndexOf('.') + 1) + "pdf";
// New document A4 standard (LETTER)
Document document = new Document(PageSize.LETTER, 0, 0, 0, 0);
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(pdf));
int pages = 0;
document.open();
PdfContentByte cb = writer.getDirectContent();
RandomAccessFileOrArray ra = null;
int comps = 0;
ra = new RandomAccessFileOrArray(tiff);
comps = TiffImage.getNumberOfPages(ra);
// Convertion statement
for (int c = 0; c < comps; ++c) {
Image img = TiffImage.getTiffImage(ra, c + 1);
if (img != null) {
System.out.println("page " + (c + 1));
img.scalePercent(7200f / img.getDpiX(), 7200f / img.getDpiY());
document.setPageSize(new Rectangle(img.getScaledWidth(), img.getScaledHeight()));
img.setAbsolutePosition(0, 0);
cb.addImage(img);
document.newPage();
++pages;
}
}
ra.close();
document.close();
} catch (Exception e) {
logger.error("Convert fail");
logger.debug("", e);
pdf = null;
}
logger.debug("[" + tiff + "] -> [" + pdf + "] OK");
return pdf;
}

Related

How to get favicon.ico from a website using Java?

So I'm making an application to store shortcuts to all the user's favorite applications, acting kind of like a hub. I can have support for actual files and I have a .lnk parser for shortcuts. I thought it would be pretty good for the application to support Internet shortcuts, too. This is what I'm doing:
Suppose I'm trying to get Google's icon (http://www.google.com/favicon.ico).
I start out by getting rid of the extra pages (e.g. www.google.com/anotherpage would become www.google.com.
Then, I use ImageIO.read(java.net.URL) to get the Image.
The problem is that ImageIO never returns an Image when I call this method:
String trimmed = getBaseURL(page); //This removes the extra pages
Image icon = null;
try {
String fullURLString = trimmed + "/favicon.ico";
URL faviconURL = new URL(fullURLString);
icon = ImageIO.read(faviconURL);
} catch (IOException e) {
e.printStackTrace();
}
return icon;
Now I have two questions:
Does Java support the ICO format even though it is from Microsoft?
Why does ImageIO fail to read from the URL?
Thank you in advance!
Try Image4J.
As this quick Scala REPL session shows (paste-able as Java code):
> net.sf.image4j.codec.ico.ICODecoder.read(new java.net.URL("http://www.google.com/favicon.ico").openStream())
res1: java.util.List[java.awt.image.BufferedImage] = [BufferedImage#65712a80: type = 2 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=ff000000 IntegerInterleavedRaster: width = 16 height = 16 #Bands = 4 xOff = 0 yOff = 0 dataOffset[0] 0]
UPDATE
To answer your questions: Does Java support ICO? Doesn't seem like it:
> javax.imageio.ImageIO.read(new java.net.URL("http://www.google.com/favicon.ico"))
java.lang.IllegalArgumentException: Empty region!
Why does ImageIO fail to read from the URL? Well, the URL itself seems to work for me, so you may have a proxy/firewall issue, or it could be the problem above.
Old post, but for future reference:
I've written a plugin for ImageIO that adds support for .ICO (MS Windows Icon) and .CUR (MS Windows Cursor) formats.
You can get it from GitHub here: https://github.com/haraldk/TwelveMonkeys/
After you have installed the plugin, you should be able to read the icon, using the code in the original post without any modifications.
You don't need ImageIO for this. Just copy the bytes, same as for any other static resource.
There is Apache Commons Imaging for reading ico files and others: https://commons.apache.org/proper/commons-imaging/index.html
Reading an ico file works like this:
List<BufferedImage> images = org.apache.commons.imaging.Imaging.getAllBufferedImages(yourIcoFile);
In your case you have to download it first, I guess.

Wrong parsing with iText's PdfTextExtractor

I'm facing a problem when I try to read the content of a PDF document. I'm using iText 2.1.7 with Java, and I need to analyze the content of a PDF document: at first I was using the PdfTextExtractor's getTextFromPage method and it was working right, but only when the page is just text, if it contains an image, then the String that I get with the getTextFromPage is a set of meaningless symbols (maybe a different character encoding?), and I lose the content of the whole page. I tried with the last version of iText and works fine, but if I'm not wrong the license wouldn't be totally free (I'm working in a web application for a commercial customer, which serves PDFs on the fly) so I can't use it. I would really appreciate if you have any suggestion.
In case you need it, here is the code:
PdfReader pdf = new PdfReader(doc); //doc is just a byte[]
int pageCount = pdf.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdf);
String pageText = pdfTextExtractor.getTextFromPage(i);
Thanks in advance, regards.
I think that you PDF has an inline image. I do not think that iText 2.1.7 will deal with that.
You can find information regarding the license here

Converting a PDF into multiple JPGs with iText or other

I have the need to convert any multipage PDF file into a set of JPGs.
Since the PDF files are supposed to come from a scanner, we can assume each page just contains a graphic object to extract, but I cannot be 100% sure of that.
So, I need to convert any renderable content from each page into a single JPEG file.
How can I do this with iText?
If I can't do this with iText, what Java library can achieve this?
Thanks.
Ghostscript (available for Windows, Linux, MacOS X, Solaris, AIX,...) can convert...
...from input formats: PDF, PostScript, EPS and AI
...into output formats: JPEG, TIFF, PNG, PNM, PPM, BMP, (and more).
(The ImageMagick mentioned above doesn't do the conversion on its own -- it uses Ghostscript under the hood, as do many other tools.)
ICEpdf - http://www.icepdf.org/ - has an open source entry version which should do what you need.
I believe the primary difference between the open source version and the pay-for version is that the pay-for has much better font support.
You can also use Sun's PDF-Renderer and JPedal does PDF to image (low and high res.
With Apache PDFBox you could do the following:
PDDocument document = PDDocument.load(pdffile);
List<PDPage> pages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = pages.get(i);
BufferedImage image = page.convertToImage(BufferedImage.TYPE_INT_RGB, 72);
ImageIO.write(image, "jpg", new File(pdffile.getAbsolutePath() + "_" + i + ".jpg"));
}

Open Source libraries for PDF to image conversion [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Export PDF pages to a series of images in Java
Please suggest some good java libraries which can be used for a PDF file to image conversion.
I tried using PDFBox: http://pdfbox.apache.org/ but after conversion to image most of my text from the pdf file was garbled in the image. It read a 'T' as a 'Y' a 'C' as a '#' and so on.
Following is the code snippet I used for the same:
PDDocument document = null;
document = PDDocument.load( pdfFile );
List pages = document.getDocumentCatalog().getAllPages();
for( int i=startPage-1; i<endPage && i<pages.size(); i++ )
{
try
{
PDPage page = (PDPage)pages.get( i );
BufferedImage image = page.convertToImage();
}
}
document.close();
I guess it is some issue that they have with rendering fonts.
In case u think I might have missed something out while using PDFBox please let me know.
Please suggest any other alternatives as well.
I have tried using jPedal: http://www.jpedal.org/ which works out fine but its not free so please suggest about all good alternatives on this.
Try icePDF

With Java: replace string in MS Word file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
We need a Java library to replace strings in MS Word files.
Can anyone suggest?
While there is MS Word support in Apache POI, it is not very good. Loading and then saving any file with other than the most basic formatting will likely garble the layout. You should try it out though, maybe it works for you.
There are a number of commercial libraries as well, but I don't know if any of them are any better.
The crappy "solution" I had to settle for when working on a similar requirement recently was using the DOCX format, opening the ZIP container, reading the document XML, and then replacing my markers with the right texts. This does work for replacing simple bits of text without paragraphs etc.
private static final String WORD_TEMPLATE_PATH = "word/word_template.docx";
private static final String DOCUMENT_XML = "word/document.xml";
/*....*/
final Resource templateFile = new ClassPathResource(WORD_TEMPLATE_PATH);
final ZipInputStream zipIn = new ZipInputStream(templateFile.getInputStream());
final ZipOutputStream zipOut = new ZipOutputStream(output);
ZipEntry inEntry;
while ((inEntry = zipIn.getNextEntry()) != null) {
final ZipEntry outEntry = new ZipEntry(inEntry.getName());
zipOut.putNextEntry(outEntry);
if (inEntry.getName().equals(DOCUMENT_XML)) {
final String contentIn = IOUtils.toString(zipIn, UTF_8);
final String outContent = this.processContent(new StringReader(contentIn));
IOUtils.write(outContent, zipOut, UTF_8);
} else {
IOUtils.copy(zipIn, zipOut);
}
zipOut.closeEntry();
}
zipIn.close();
zipOut.finish();
I'm not proud of it, but it works.
I would suggest the Apache POI library:
http://poi.apache.org/
Looking more - it looks like it hasn't been kept up to date - Boo! It may be complete enough now to do what you need however.
Try this one: http://www.dancrintea.ro/doc-to-pdf/
Besides replacing strings in ms word files can also:
- read/write Excel files using simplified API like: getCell(x,y) and setCell(x,y,string)
- hide Excel sheets(secondary calculations for example)
- replace images in DOC, ODT and SXW files
- and convert:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf
I would take a look at the Apache POI project. This is what I have used to interact with MS documents in the past.
http://poi.apache.org/
Thanks all. I am gonna try http://www.dancrintea.ro/doc-to-pdf/
because I need to convert classic DOC file(binary) and not DOCX(zip format).

Categories