I have a program that outputs to PDF, however, I want it to be able to read from it.
I have come up with my own data type which my program is able to read, but I need it somehow included in PDF file (no multiple files, I want one file per single output).
I also need this data to be invisible and undetectable for the user.
I heard something about PDF dictionaries, but I'm not sure how to do it (or if there's another way). I do not want to use XMP/XML file, my data is more complex than key-value.
What would be nice is somebody writing me couple example lines of code that would enable me to:
add new dicitonary to PDF using iText
populate it with data using iText
locate it in a file using iText
read from it using iText
You want to do something similar to what Adobe Illustrator is doing. If you create a PDF from Adobe Illustrator, you can encapsulate the original AI file. This gives you the impression the PDF can be edited. In reality, Adobe Illustrator takes the AI file and uses that to edit, and re-creates the PDF from the updated AI.
Where is this information stored? See ISO-32000-1 section 14.5:
Conforming products may use this dictionary as a place to store
private data in connection with that document, page, or form. Such
private data can convey information meaningful to the conforming
product that produces it (such as information on object grouping for a
graphics editor or the layer information used by Adobe Photoshop®) but
may be ignored by general-purpose conforming readers.
I'm not sure what is asked here. If you're asking for advice like what I answered above: for instance add a PieceInfo entry to the Root dictionary (aka Catalog). This is all documented, isn't it? Read the ISO specification, and read part 4 of "iText in Action".
If your question is: write some code for me that does what I need to do. then I believe that's more or less in violation with the goal of this site.
Well you could hex encode your data as a String and then draw it off screen like this:
cb.showTextAligned(PdfContentByte.ALIGN_LEFT,"HIDDENDATA_"+ hexencodeddata, 2000f,2000f, 0f);
and to read process all string searching for HIDDENDATA_
Another way is to use Annotations
public void addAnnotation(PdfWriter writer,
Document document, Rectangle rect, String text) {
PdfAnnotation annotation = new PdfAnnotation(writer,
new Rectangle(
rect.getRight() + 10, rect.getBottom(),
rect.getRight() + 30, rect.getTop()));
annotation.setTitle("Text annotation");
annotation.put(PdfName.SUBTYPE, PdfName.TEXT);
annotation.put(PdfName.OPEN, PdfBoolean.PDFFALSE);
annotation.put(PdfName.NAME, new PdfName(text));
writer.addAnnotation(annotation);
}
And then use some like this to read it.
http://downloads.snowtide.com/javadoc/PDFTextStream/2.3.2/com/snowtide/pdf/PDFTextStream.html
Related
So I have a template pdf for an agenda, what I want to know is how do I detect where the date should be.
Lets say in the template there is the word “DATE:”.
After that I want add the corresponding date/text next to that space so I detect “DATE:” and after writing it looks something like “DATE: 13/02/2020” and save it as a new pdf
You tagged your question both java and python-3.x. That makes it very broad. My answer, therefore, also is generic, not specific. In general you should decide which language you ask for.
For your task you will need to do two things,
first apply text extraction with coordinates to your pdf, search for that DATE marker in the text, and determine the coordinates right after that text piece; some libraries allow a shortcut and have routines that only extract text matching a regular expressions and its coordinates;
then add text to your content at those coordinates.
Neither java nor python have explicit pdf support in their core. Thus, for your task you'll have to choose a pdf library for those tasks. (Theoretically you could try and implement your own pdf processing routines, but the pdf format is quite complex, so in general that would take very long.)
So you first should check which general purpose pdf library for your chosen language appears most appropriate for those tasks and your other requirements (like licensing). There are many questions and answers on stack overflow concerning text extraction which may help you in choosing.
Some words of warning, though, not all pdfs allow proper text extraction. There are pdf generators which don't add the information required for text extraction to pdfs; some actually even add misleading information. Thus, you might have to reject some templates. Alternatively, if the template is fixed, simply determine the correct coordinates for text insertion by measuring in a pdf viewer or by trial and error.
And if you still have influence on the requirements, propose to use templates with pdf AcroForm form fields. Form field fill-in allows more control for the template designer concerning the positioning and styling of the fill-ins, and fill-in is easier than the process outlined above. If you don't want form fields in the result pdfs, simply flatten the forms after fill-in.
I'm developing a Java web App which could calculate one's IQ. I want the App to have an option Get Your Certificate at the end. I want a PDF file (A Certificate of appreciation) to be auto generated with the pre-entered name of the User and his IQ Score.
How can one achieve this? I've already seen this type of feature in some websites which provide certifications..
Java PDF APIs
Here is an answer to a similar question referencing a few well-known APIs.
Here is a more recent article detailing the licenses for those APIs.
Yet another listing of resources.
Flow of control
User clicks a link that generates a request that will be handled by the servlet.
Extract whatever you need from the URL within the servlet.
Use your chosen API to build the content for the PDF using a writer.
Push the PDF to the client.
Take a look a some iText samples. You can fill out a form, then click "flatten" and you have a PDF containing the data you used. As you're talking about a certificate, the easiest solution would be to create a PDF template using AcroForm technology. For instance: state.pdf is the interactive PDF that was used in the example I just mentioned.
The code used to fill out and flatten this form can be found here. For more examples, please read Chapter 6 of my book "iText in Action" (that chapter is available for free; you need section 6.3.5). I've also written a complete chapter about integrating code like this in a web application. You can find the examples that come with this chapter here.
Basically, you need to do something like this:
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields fields = stamper.getAcroFields();
fields.setField("name", "CALIFORNIA");
fields.setField("abbr", "CA");
fields.setField("capital", "Sacramento");
fields.setField("city", "Los Angeles");
fields.setField("population", "36,961,664");
fields.setField("surface", "163,707");
fields.setField("timezone1", "PT (UTC-8)");
fields.setField("timezone2", "-");
fields.setField("dst", "YES");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
Caveat regarding the data that is entered: The simple example uses a very basic font that doesn't know how to display special characters. If you need characters such as ñ, é, à, etc... You'll need to introduce a font with more glyphs.
Caveat regarding the jsp-tag you used: I have written this helloworld.jsp that results in this PDF, which proves that is is possible to generate PDF from JSP. Nevertheless, it is a bad idea to do so. When you learned how to write JSP, your teacher probably told you that JSP shouldn't be used to create binary files. (If he didn't tell you this, he either forgot or he wasn't a good teacher.) As there are so many pitfalls when using JSP to create binary files and as a JSP file is eventually compiled to a Servlet anyway, you should forget about creating a JSP to create a PDF and prefer writing a Servlet. It will save you plenty of time and your code will be easier to maintain (the slightest change to your JSP file can break the code).
I have a program which will be used for building questions database. I'm making it for a site that want user to know that contet was donwloaded from that site. That's why I want the output be PDF - almost everyone can view it, almost nobody can edit it (and remove e.g. footer or watermark, unlike in some simpler file types). That explains why it HAS to be PDF.
This program will be used by numerous users which will create new databases or expand existing ones. That's why having output formed as multple files is extremly sloppy and inefficient way of achieving what I want to achieve (it would complicate things for the user).
And what I want to do is to create PDF files which are still editable with my program once created.
I want to achieve this by implementing my custom file type readable with my program into the output PDF.
I came up with three ways of doing that:
Attach the file to PDF and then corrupting the part of PDF which contains it in a way it just makes the PDF unaware that it contains the file, thus making imposible for user to notice it (easely). Upon reading the document I'd revert the corruption and extract file using one of may PDF libraries.
Hide the file inside an image which would be added to the PDF somwhere on the first or last page, somehow (that is still need to work out) hidden from the public eye. Knowing it's location, it should be relativley easy to retrieve it using PDF library.
I have learned that if you add "%" sign as a first character in line inside a PDF, the whole line will be ignored (similar to "//" in Java) by the PDF reader (atleast Adobe reader), making possible for me to add as many lines as I want to the PDF (if I know where, and I do) whitout the end user being aware of that. I could implement my whole custom file into PDF that way. The problem here is that I actually have to read the PDF using one of the Java's input readers, but I'm not sure which one. I understand that PDF can't be read like a text file since it's a binary file (Right?).
In the end, I decided to go with the method number 3.
Unless someone has any better ideas, and the conditions are:
1. One file only. And that file is PDF.
2. User must not be aware of the addition.
The problem is that I don't know how to read the PDF as a file (I'm not trying to read it as a PDF, which I would do using a PDF library).
So, does anyone have a better idea?
If not, how do I read PDF as a FILE, so the output is array of characters (with newline detection), and then rewrite the whole file with my content addition?
In Java, there is no real difference between text and binary files, you can read them both as an inputstream. The difference is that for binary files, you can't really create a Reader for it, because that assumes there's a way to convert the byte stream to unicode characters, and that won't work for PDF files.
So in your case, you'd need to read the files in byte buffers and possibly loop over them to scan for bytes representing the '%' and end-of-line character in PDF.
A better way is to use another existing way of encoding data in a PDF: XMP tags. This is allows any sort of complex Key-Value pairs to be encoded in XML and embedded in PDF's, JPEGs etc. See http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf.
There's an open source library in Java that allows you to manipulate that: http://pdfbox.apache.org/userguide/metadata.html. See also a related question from another guy who succeeded in it: custom schema to XMP metadata or http://plindenbaum.blogspot.co.uk/2010/07/pdfbox-insertextract-metadata-frominto.html
It's all just 1's and 0's - just use RandomAccessFile and start reading. The PDF specification defines what a valid newline character(s) is/are (there are several). Grab a hex editor and open a PDF and you can at least start getting a feel for things. Be careful of where you insert your lines though - you'll need to add them towards the end of the file where they won't screw up the xref table offsets to the obj entries.
Here's a related question that may be of interest: PDF parsing file trailer
I would suggest putting your comment immediately before the startxref line. If you put it anywhere else, you could wind up shifting things around and breaking the xref table pointers.
So a simple algorithm for inserting your special comment will be:
Go to the end of the file
Search backwards for startxref
Insert your special comment immediately before startxref - be sure to insert a newline character at the end of your special comment
Save the PDF
You can (and should) do this manually in a hex editor.
Really important: are your users going to be saving changes to these files? i.e. if they fill in the form field, are they going to hit save? If they are, your comment lines may be removed during the save (and different versions of different PDF viewers could behave differently in this regard).
XMP tags are the correct way to do what you are trying to do - you can embed entire XML segments, and I think you'd be hard pressed to come up with a data structure that couldn't be expressed as XML.
I personally recommend using iText for this, but I'm biased (I'm one of the devs). The iText In Action book has an excellent chapter on embedding XMP data into PDFs. Here's some sample code from the book (which I definitely recommend): http://itextpdf.com/examples/iia.php?id=217
I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.
In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.
I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.