What is the best way to generate a PCL output file from an existing PDF file in java?
It depends on how much you want to invest, and how robust the solution needs to be. For quick and dirty, you can print from Adobe Acrobat to a file, using a PCL driver (look, mom, no Java ...).
The Java Print Service API can process PDF. Use StreamPrintService and write the stream to a file, using PCL for the output format.
If you need to have more control over the content, maybe modify it or add to it, you can use a PDF parser (this one, for instance) and print the resulting HTML from a browser that your application starts, by adding some Javascript, for example.
The StreamPrintService from JDK 6 does only support PS. I am still searching for a StreamPrintService which supports PCL.
We capture PCL generated from Acrobat printing a PDF to a PCL driver and redirect as input to our Windows console PCLXForm program. With a custom script, we can "stream edit" the PCL. We can extract the address block text for address correction, insert the corrected text, add the Intelligent Mail Barcode, 2-D barcodes, sort the documents, batch them by page count, change tray assignments, merge with other documents, etc. The product required is PCLTool SDK - Option V at www.pagetech.com
Related
We currently use the PhantomJS executable for two things in our Java project:
Create a PDF file from a given String html we get from our database (for which we write the String to a temp file first)
Create a screenshot of a given Widget-Object (for which we have an open HTML page on the front-end)
Since PhantomJS hasn't been updated for a few years, I'm about to change it to a headless Chromium method instead, which has the options --print-to-pdf and --screenshot for options 1 and 2.
Option 2 isn't really relevant since we have a page, but for option 1 it would be nice if we could directly use the chromium command-line with the given String. Right now with PhantomJS, we convert the String to a temp file, and then use the executable to generate the actual PDF output file. I can of course do the same with the headless Chromium executable, but since I'm changing it right now anyway, it would be nice if the 'String to temp HTML file' step wouldn't be necessary for creating the output PDF file, since we already have the page in memory anyway after retrieving it from the database.
From what I've seen, the Chromium executable is usually run for either a HTML file to PDF file:
chromium --headless -disable-gpu --print-to-pdf="C:/path/to/output-file.pdf" C:/path/to/input-file.html
Or for a HTML page to PDF file:
chromium --headless -disable-gpu --print-to-pdf="C:/path/to/output-file.pdf" https://www.google.com/
I couldn't really find the docs for the chrome/chromium executable (although I have been able to find the list of command options in the source code), so maybe there are more options besides these two above? (If anyone has a link to the docs, that would be great as well.)
If not, I guess I'll just use a temp file as we did before with PhantomJS.
The terms 'chrome read stdin' would probably have brought you to this question explaining how to read from a data url:
chrome.exe "data:text/html;base64,PCFET0NUWVBFIGh0bWw+PGh0bWw+PGhlYWQ+PHRpdGxlPlRlc3Q8L3RpdGxlPjwvaGVhZD48Ym9keT5ZbzwvYm9keT48L2h0bWw+"
Reading input from stdin sounds like you would also want to write the output to stdout: 'chrome pdf to stdout'. Which links to someone trying the same thing and running into issues about not being able to combine --stdout with screenshot or pdf output from 2018.
And (depending on the usecase) even worse, a limitation of the data url's of 2MB.
So if you can't guarantee the input to be less than 2MB you might be better off using files anyway, or check if the limitation has been removed.
Also, given that you specify that option 2 has a solution in serving the page directly, would that not also open up the option to do the same for option 1?
You should not need redundant -disable-gpu usure which version it was not needed for Windows, but redundant in https://chromium-review.googlesource.com/c/chromium/src/+/1161172 (2018), however, you may want to replace with --print-to-pdf-no-header to avoid those.
Your using windows as the shell to run Chrome/MSEdge.exe so for that reason there will be a significantly smaller command line ability to CMD pass a variable string.
To pass a base64 for html string as stdin will often be limited by content for similar string length reasons to say 1.5MB (75% of 2MB). Thus in special exceptional cases that may be 4096 pages see https://github.com/GitHubRulesOK/MyNotes/raw/master/Hanoi.htm however the norm is usually only a few standard html pages.
PDF file handling requires a file system to generate the pages, thus a file centred approach to store the decimal based file index. So the memory work around is to use a RamDrive/Disk or its Bytes IO equivalent as named FileStream object.
Using PDF data in memory is usually highly disk intensive as the limited resources after contents program processing need to draw on the bus disk cache to augment virtual ram. As a result working in memory can be just as slow if not slower than using cached disk file data.
%Tmp% / %temp% files can usually respond quicker and be very easily overwritten.
There are many other working and non-working switches bandied about the web, but the semi-official list is https://peter.sh/experiments/chromium-command-line-switches/
I'm planning to put some Java code in an appendix to my report. The report is a PDF document, and I use Eclipse for Java.
How can I present it best and do this easily? Any recommendations?
For this purpose, I created a LaTeX doclet. This is a Javadoc doclet, which converts the javadoc comments to LaTeX code, and (if wanted) also includes a pretty-printed version of the source code of the documented methods.
You can then convert the generated LaTeX document to PDF, and append it to your report.
If you use Windows, install CutePDF. This adds a "Printer" that when you print to it it asks you a file name and then prints the output to a .pdf document on your hard drive - hence it is a psuedo printer - it acts like a printer, but is really a pdf file writer.
Don't know solutions for other o/s...
I usually prefer to install a PDF "psuedo" printer in whatever OS I am using. That way I can use the print facilities of whatever app I am using (like Eclipse for example) and get the result in PDF file.
EDIT:
Here is one example of a pseudo printer, this for the Windows platform. Mac OS X has a built in "print to PDF file" capability.
You can use doxygen to generate documentation for your project which can include a formatted source file listing in addition to Javadoc. doxygen can generate both HTML and PDF output. You'll need latex to generate the PDF output.
Another way to pretty print is with IntelliJIDEA. It works also with the community edition.
It's advisable to install a PDF printer, in order to try printouts without wasting a lot of paper. Once you're satisfied with the result, you can print on the real printer. On Windows you can use CutePDF, on Linux Ubuntu install the package cups-pdf with sudo apt-get install cups-pdf.
Note that IntelliJ prints the theme's background, so it's advisable to be on a white background to avoid wasting ink.
To print click on menu File -> Print. The printer selection is in the next menu, after you press on the Print button.
Interestingly you can also print only the selected text, which is useful if you don't want to print import statements.
Other options include the possibility to add line numbers, syntax highlighting and colour printing. On Linux IntelliJ 14.0.3, the default font was a huge size 14, so you might want to change that too.
You could just copy & paste into Word (2007+) and save as PDF. It's a little more straightforward than the file printer, and you can format your code for best results in Word.
You could just copy & paste into OpenOffice/LibreOffice and export to PDF.
I want to like after click a JButton will directly pop out a printer window to print the pdf file no need to show the file, is it possible?
Multiple ways to do it ,
You can get access to printers installed, this requires how the printers are configured etc and then you require some print plugin to write it to pdf file.
Else you can use plenty of java pdf libraries available to do the pdf creation part too
See Desktop.print(File).
How can i read content of scanned pdf file in java/jsp or in javascript, can you tell how to achieve this with developing code?
advance thanks for reply
You can convert the scanned PDF to a image using GhostScript and then feed it to an OCR engine, such as Tesseract. Take a look at VietOCR for an example implementation.
What you are trying to do (I think) is use OCR to extract text from a image PDF produced by a scanner. Java is probably the best for doing this. There are a number of options for doing this, depending on whether you are prepared to pay for software to do this. Google for Java (or Javascript), PDF and OCR.
IMO, this task is not something that should be done in a JSP. JSPs are best for rendering results ... not for generating them in the first place.
Actually, I am working on the same project at the moment, I am doing this in the following steps and the result works well.
User upload a scanned pdf to PDFUploader servlet, returns a server side file name to front end, which indicates upload is successful.
Front end uses this file name and default page 0 to ask PDFReader servlet to retrieve the first page of pdf file and display is at the front end, you can convert this pdf to a image for use an iframe to have the embedded pdf reader.
Front end uses this file name and default page 0 to ask OCRServlet to perform OCR. I am using WeOCR and tesseract as my OCR engine in an Apache http server. I have modified some parts of the submit.cgi in WeOCR server since I know what types of the format that the WeOCR server will receive. I still have some problems while I convert the scanned pdf to an image (I am using pdfbox )
Google for anything OCR related,
best bet will be to use existing libraries like http://asprise.com/product/ocr/index.php?lang=java
Is there any free Java library for extracting text from PDF, that is compatible with Google Application Engine?
I've read about PDFJet, but it can't read PDF, can it?
Is there perhaps other way how to extract text from PDF? I tried http://www.pdfdownload.org/, unfortunately they don't handle non-English characters correctly.
iText now has a text parsing module (I'm one of the parser authors). See the com.itextpdf.text.pdf.parser.PdfContentReaderTool class for an example of how to use it.
PdfBox does not run on GAE. It uses not-allowed java classes.
(GAE only permits these http://code.google.com/appengine/docs/java/jrewhitelist.html)
I have partially modified a very old version of PdfBox (0.7.3) to be GAE complaiant. Now I'm able to extract text from PDF (whole page or rectangular area). I only modified a minumum part of the pdf text extraction and not the whole PdfBox. :)
The idea was to remove refences to java.awt.retangle & C. using my own "rectangle" class.
More info: http://fhtino.blogspot.com/2010/04/pdfbox-text-extration-gae.html
I modified the latest (1.8.0-Snapshot) version to run on Google AppEngine. Had to disable one Unit-Test, but it runs fine for simple text extraction.
Following the simple try-fail-fix approach i had to modify 5 files in total. Pretty doable.
You'll also have to explicitly use a RandomAccessBuffer, like Fabrizio explained.
For the extra lazy, heres the compiled jar, dependencies for text extraction, and the patch. Note that it might not work for every usecase (i.e. rectangle based extraction). Used it to extract text of a whole page.
https://docs.google.com/folder/d/0B53n_gP2oU6iVjhOOVBNZHk0a0E/edit
I know there is http://pdfbox.apache.org/index.html
Apache PDFBox is an open source Java
PDF library for working with PDF
documents. This project allows
creation of new PDF documents,
manipulation of existing documents and
the ability to extract content from
documents.
but I've never tested it.
Last month, I'd just finished extracting text from pdf file in my project. I used XPDF tool for getting text, and text coordinates, but I used it in Xcode (Objective-C). This tool was open source, written by C++, and able to be encoded in many language. However, I didn't know whether XPdf would be work on your java, or not. Anyway, You can try this tool.