Adding text to a pdf? - java

So I have a template pdf for an agenda, what I want to know is how do I detect where the date should be.
Lets say in the template there is the word “DATE:”.
After that I want add the corresponding date/text next to that space so I detect “DATE:” and after writing it looks something like “DATE: 13/02/2020” and save it as a new pdf

You tagged your question both java and python-3.x. That makes it very broad. My answer, therefore, also is generic, not specific. In general you should decide which language you ask for.
For your task you will need to do two things,
first apply text extraction with coordinates to your pdf, search for that DATE marker in the text, and determine the coordinates right after that text piece; some libraries allow a shortcut and have routines that only extract text matching a regular expressions and its coordinates;
then add text to your content at those coordinates.
Neither java nor python have explicit pdf support in their core. Thus, for your task you'll have to choose a pdf library for those tasks. (Theoretically you could try and implement your own pdf processing routines, but the pdf format is quite complex, so in general that would take very long.)
So you first should check which general purpose pdf library for your chosen language appears most appropriate for those tasks and your other requirements (like licensing). There are many questions and answers on stack overflow concerning text extraction which may help you in choosing.
Some words of warning, though, not all pdfs allow proper text extraction. There are pdf generators which don't add the information required for text extraction to pdfs; some actually even add misleading information. Thus, you might have to reject some templates. Alternatively, if the template is fixed, simply determine the correct coordinates for text insertion by measuring in a pdf viewer or by trial and error.
And if you still have influence on the requirements, propose to use templates with pdf AcroForm form fields. Form field fill-in allows more control for the template designer concerning the positioning and styling of the fill-ins, and fill-in is easier than the process outlined above. If you don't want form fields in the result pdfs, simply flatten the forms after fill-in.

Related

Alternative to Markdown with Color support

I am writing on a Note App (Android and REST API built with PHP/Slim 3). I am wondering if there is something else than Markdown to save notes to a readable and interchangeable format. The problem with Markdown for me is that there is no solution to style texts (e.g. colored text). It is also hard to extend Markdown with custom attributes.
I am already thinking of creating an own data format (or using XML). But this means a lot of work for parsing it. I like the idea of using a standard format to interchange it between client/server and between other applications. But the featureset of Markdown is very limited (by design for sure).
Do you have any tips on this topic?
This question verges on overly-broad, i.e. it may lead to an argument over technologies rather than a "this is the solution" situation.
That being said, here's an answer I think won't be controversial: when you say
"readable, interchangeable format... solution to style texts... custom attributes"
I think HTML. I don't recommend trying to roll-your-own format, because 1.) you are correct that it will be difficult and 2.) it will be even more difficult to match the feature sets of existing solutions
To sum it up: I like the idea of using HTML instead of Markdown. It is an open standard format and exchangable as well as human-readable.
The problem I see with all of these solutions: How to write a WYSIWYG-Editor with this in mind? I am already working with Markdown using the Markwon library: https://github.com/noties/Markwon
It is no problem to write Markdown in an Android EditText widget and render it. You can easily convert it back to plaintext (you can save it). It is much more complicated to get a WYSIWYG experience. You have to deal with every User input, writing in a second file or string which contains the Markup while the user just sees the rendered result. The user can edit/delete anything anywhere in the EditText and you have to take care that those changes will affect the Markdown String/File too. I didn't find an easy solution for this.
The easiest way would be to somehow parse the content of the EditText back to Markdown. But there is no getSpannables-method or alike for the EditText widget. I am thinking of looping through the EditText and see what character is there and how it's formatted. But I think this will have disadvantages too, because there are other things like bulleted lists and checkboxes..

Creating a invisible PDF object with iText

I have a program that outputs to PDF, however, I want it to be able to read from it.
I have come up with my own data type which my program is able to read, but I need it somehow included in PDF file (no multiple files, I want one file per single output).
I also need this data to be invisible and undetectable for the user.
I heard something about PDF dictionaries, but I'm not sure how to do it (or if there's another way). I do not want to use XMP/XML file, my data is more complex than key-value.
What would be nice is somebody writing me couple example lines of code that would enable me to:
add new dicitonary to PDF using iText
populate it with data using iText
locate it in a file using iText
read from it using iText
You want to do something similar to what Adobe Illustrator is doing. If you create a PDF from Adobe Illustrator, you can encapsulate the original AI file. This gives you the impression the PDF can be edited. In reality, Adobe Illustrator takes the AI file and uses that to edit, and re-creates the PDF from the updated AI.
Where is this information stored? See ISO-32000-1 section 14.5:
Conforming products may use this dictionary as a place to store
private data in connection with that document, page, or form. Such
private data can convey information meaningful to the conforming
product that produces it (such as information on object grouping for a
graphics editor or the layer information used by Adobe Photoshop®) but
may be ignored by general-purpose conforming readers.
I'm not sure what is asked here. If you're asking for advice like what I answered above: for instance add a PieceInfo entry to the Root dictionary (aka Catalog). This is all documented, isn't it? Read the ISO specification, and read part 4 of "iText in Action".
If your question is: write some code for me that does what I need to do. then I believe that's more or less in violation with the goal of this site.
Well you could hex encode your data as a String and then draw it off screen like this:
cb.showTextAligned(PdfContentByte.ALIGN_LEFT,"HIDDENDATA_"+ hexencodeddata, 2000f,2000f, 0f);
and to read process all string searching for HIDDENDATA_
Another way is to use Annotations
public void addAnnotation(PdfWriter writer,
Document document, Rectangle rect, String text) {
PdfAnnotation annotation = new PdfAnnotation(writer,
new Rectangle(
rect.getRight() + 10, rect.getBottom(),
rect.getRight() + 30, rect.getTop()));
annotation.setTitle("Text annotation");
annotation.put(PdfName.SUBTYPE, PdfName.TEXT);
annotation.put(PdfName.OPEN, PdfBoolean.PDFFALSE);
annotation.put(PdfName.NAME, new PdfName(text));
writer.addAnnotation(annotation);
}
And then use some like this to read it.
http://downloads.snowtide.com/javadoc/PDFTextStream/2.3.2/com/snowtide/pdf/PDFTextStream.html

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

rendering pdf on webside ala google documents

In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.
I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.

Long text input from user and PDF generation

I have built a web application that can be seen as an overcomplicated application form. There are bunch of text areas with a given character limit. After the form submission various things happen and one of them is PDF generation.
The text is queried from the DB and inserted in the PDF template created in iReports. This works fine but the major pain is overflowing text.
The maximum number of characters is set based on 'average' text. But sometimes people prefer to write with CAPS or add plenty of linefeeds to format their text. These then cause user's text to overflow the space given in PDF. Unfortunately the PDF document must look like a real application form so I cannot allow unlimited space.
What kind of approaches you have used to tackle this?
Clean/restrict user input?
Calculate the space requirement of the text based on font metrics?
Provide preview of the PDF? (too bad users are not allowed to change their input after submission...)
Ideally, calculate the requirement based on metrics. I don't know how iReports handles text, but with iText, it lays everything out itself, you just present the data as a streaming document, so we don't worry about overflowing text.
However, iReport may not support that, or you may need to have the PDF layout fit within certain bounds. I'd try to clean the input (ie: if it's all caps, lowercase/sentence case/proper case it), strip extra whitespace. If cleaning the input can't be reliably done, or people are still getting past that, I'd also restrict it.
As a last resort, I'd present the PDF for the user to authorize. Really, users shouldn't be given more work to do, and they're not going to do it anyways.
Your own suggested solutions to your problem are all good. Probably the most important question to have answered is what should your PDF look like when the data to be displayed in a field won't fit? Do you ever need the "full answer" for anything else? When you know the answer to these, you'll have your options reduced.
For example if a field must be limited to 1/2 a page, and users sometimes enter more than 1/2 a page of text you can either
1) limit the user input - on submission calculate the size (using font-metrics as you said) and reject the submission until corrected. This assumes you can legitimately force the user to reduce their data entry.
2) accept the user input and truncate in the display of this report. Some systems use "..." to indicate data has been truncated, and can provide a hyperlink (even within the PDF) to get more information.
Providing a preview would work really well, but only if the users are good at checking and correcting and your system can handle the extra load this will generate.
Do you have control of the font that is used when generating the PDF? If so, I would look for a font in the Monospace family. This will give you consistent length for a given number of chars, regardless of puncuation, capitalization, etc.

Categories