Auto-generate the Content in PDF format - Java

Auto-generate the Content in PDF format - Java - java

I'm developing a Java web App which could calculate one's IQ. I want the App to have an option Get Your Certificate at the end. I want a PDF file (A Certificate of appreciation) to be auto generated with the pre-entered name of the User and his IQ Score.
How can one achieve this? I've already seen this type of feature in some websites which provide certifications..

Java PDF APIs
Here is an answer to a similar question referencing a few well-known APIs.
Here is a more recent article detailing the licenses for those APIs.
Yet another listing of resources.
Flow of control
User clicks a link that generates a request that will be handled by the servlet.
Extract whatever you need from the URL within the servlet.
Use your chosen API to build the content for the PDF using a writer.
Push the PDF to the client.

Take a look a some iText samples. You can fill out a form, then click "flatten" and you have a PDF containing the data you used. As you're talking about a certificate, the easiest solution would be to create a PDF template using AcroForm technology. For instance: state.pdf is the interactive PDF that was used in the example I just mentioned.
The code used to fill out and flatten this form can be found here. For more examples, please read Chapter 6 of my book "iText in Action" (that chapter is available for free; you need section 6.3.5). I've also written a complete chapter about integrating code like this in a web application. You can find the examples that come with this chapter here.
Basically, you need to do something like this:
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields fields = stamper.getAcroFields();
fields.setField("name", "CALIFORNIA");
fields.setField("abbr", "CA");
fields.setField("capital", "Sacramento");
fields.setField("city", "Los Angeles");
fields.setField("population", "36,961,664");
fields.setField("surface", "163,707");
fields.setField("timezone1", "PT (UTC-8)");
fields.setField("timezone2", "-");
fields.setField("dst", "YES");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
Caveat regarding the data that is entered: The simple example uses a very basic font that doesn't know how to display special characters. If you need characters such as ñ, é, à, etc... You'll need to introduce a font with more glyphs.
Caveat regarding the jsp-tag you used: I have written this helloworld.jsp that results in this PDF, which proves that is is possible to generate PDF from JSP. Nevertheless, it is a bad idea to do so. When you learned how to write JSP, your teacher probably told you that JSP shouldn't be used to create binary files. (If he didn't tell you this, he either forgot or he wasn't a good teacher.) As there are so many pitfalls when using JSP to create binary files and as a JSP file is eventually compiled to a Servlet anyway, you should forget about creating a JSP to create a PDF and prefer writing a Servlet. It will save you plenty of time and your code will be easier to maintain (the slightest change to your JSP file can break the code).

Related

Creating complex pdf using java

I have an Java/Java EE based application wherein I have a requirement to create PDF certificates for various services that will be provided to the users. I am looking for a way to create PDF (no need for digital certificates for now).
What is the easiest and convenient way of doing that? I have tried
XSL to pdf conversion
HTML to PDF conversion using itext.
crude java way (using PDFWriter, PdfPCell etc.)
What is the best way out of these or is there any other way which is easier and convenient?

When you talk about Certificates, I think of standard sheets that look identical for every receiver of the certificate, except for:
the name of the receiver
the course that was followed by the receiver
a date
If this is the case, I would use any tool that allows you to create a fancy certificate (Acrobat, Open Office, Adobe InDesign,...) and create a static form (sometimes referred to as an AcroForm) containing three fields: name, course, date.
I would then use iText to fill in the fields like this:
PdfReader reader = new PdfReader(pathToCertificateTemplate);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(pathToCertificate));
AcroFields form = stamper.getAcroFields();
form.setField("name", name);
form.setField("course", course);
form.setField("date", date);
stamper.setFormFlattening(true);
stamper.close();
reader.close();
Creating such a certificate from code is "the hard way"; creating such a certificate from XML is "a pain" (because XML isn't well-suited for defining a layout), creating a certificate from (HTML + CSS) is possible with iText's XML Worker, but all of these solutions have the disadvantage that it's hard work to position every item correctly, to make sure everything fits on the same page, etc...
It's much easier to maintain a template with fixed fields. This way, you only have to code once. If for some reason you want to move the fields to another place, you only have to change the template, you don't have to worry about messing around in code, XML, HTML or CSS.
See http://www.manning.com/lowagie2/samplechapter6.pdf for some more info (section 6.3.5).

Try using Jasper Reports mate. Check it out at http://community.jaspersoft.com/

I recommend the first method: XSL to pdf conversion, which is the most powerful. I have experience to produce a lot of PDF reports(each having thousands of pages) gracefully by use of Apache FOP, I think it's good enough and fairly easy(but it requires some knowledge of xsl-FO).

Even though, this is old question, I think it should be anwered.
To create very complex pdf such as certificates,reports or payment slips etc.
You can definitely use Dynamic Reports library. This library is dependent on jasper reports (This is also very popular and old library). Dynamic reports will provide you to design your documents using java code so that you can easily manipulate or make changes as required.
There are lots of examples available there at their site and very easy to learn from those examples.
Below is link for it :
http://www.dynamicreports.org/

Bruno Lowagie pointed out a great way to generate a Template which is the same basically for all data and needs to be populated. However, Bruno Lowagie recommends iText as library to populate the fields. For me like for Ankit, this license was an issue why I had to choose another library. In the following I have a step-by-step guide how to create a template and populate it with data using Apaches PdfBox
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16</version>
</dependency>
Create a Template with LibreOffice Writer. For placeholders use
TextBoxes (View >> Toolbars >> Form Controls ). This will create a PDF with AcroForms as suggested by Bruno Lowagie
Set a name for each Textbox. Set read-only to true.
Save the document as PDF.
Read the PDF-Template with PdfBox and set the values for the
textboxes.
InputStream is = getClass().getClassLoader().getResourceAsStream("Template.pdf");
try {
PDDocument pDDocument = PDDocument.load(is);
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField fieldName = pDAcroForm.getField("name");
fieldName.setValue("FirstName Surname"); // <-- Replacement
pDDocument.save(outStream);
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}

Use iText pdf library for creating the pdf's It will be easy for you to generate pdfs from that api. Here is the link
http://itextpdf.com/
Text ® is a library that allows you to create and manipulate PDF documents. It enables developers looking to enhance web- and other applications with dynamic PDF document generation and/or manipulation.
Developers can use iText to:
Serve PDF to a browser
Generate dynamic documents from XML files or databases
Use PDF's many interactive features
Add bookmarks, page numbers, watermarks, etc.
Split, concatenate, and manipulate PDF pages
Automate filling out of PDF forms
Add digital signatures to a PDF file

You mentioned the PDFs can be complex. If this is to do with variability or layout, one option that provides reasonably sophisticated template-based layouts and controls is Docmosis. You provide Docmosis with doc or odt files as templates so they are very easy to change and the call Docmosis to mail-merge to create the pdf or other formats. Please not I work for the company that created Docmosis.
Hope that helps.

Creating a invisible PDF object with iText

I have a program that outputs to PDF, however, I want it to be able to read from it.
I have come up with my own data type which my program is able to read, but I need it somehow included in PDF file (no multiple files, I want one file per single output).
I also need this data to be invisible and undetectable for the user.
I heard something about PDF dictionaries, but I'm not sure how to do it (or if there's another way). I do not want to use XMP/XML file, my data is more complex than key-value.
What would be nice is somebody writing me couple example lines of code that would enable me to:
add new dicitonary to PDF using iText
populate it with data using iText
locate it in a file using iText
read from it using iText

You want to do something similar to what Adobe Illustrator is doing. If you create a PDF from Adobe Illustrator, you can encapsulate the original AI file. This gives you the impression the PDF can be edited. In reality, Adobe Illustrator takes the AI file and uses that to edit, and re-creates the PDF from the updated AI.
Where is this information stored? See ISO-32000-1 section 14.5:
Conforming products may use this dictionary as a place to store
private data in connection with that document, page, or form. Such
private data can convey information meaningful to the conforming
product that produces it (such as information on object grouping for a
graphics editor or the layer information used by Adobe Photoshop®) but
may be ignored by general-purpose conforming readers.
I'm not sure what is asked here. If you're asking for advice like what I answered above: for instance add a PieceInfo entry to the Root dictionary (aka Catalog). This is all documented, isn't it? Read the ISO specification, and read part 4 of "iText in Action".
If your question is: write some code for me that does what I need to do. then I believe that's more or less in violation with the goal of this site.

Well you could hex encode your data as a String and then draw it off screen like this:
cb.showTextAligned(PdfContentByte.ALIGN_LEFT,"HIDDENDATA_"+ hexencodeddata, 2000f,2000f, 0f);
and to read process all string searching for HIDDENDATA_
Another way is to use Annotations
public void addAnnotation(PdfWriter writer,
Document document, Rectangle rect, String text) {
PdfAnnotation annotation = new PdfAnnotation(writer,
new Rectangle(
rect.getRight() + 10, rect.getBottom(),
rect.getRight() + 30, rect.getTop()));
annotation.setTitle("Text annotation");
annotation.put(PdfName.SUBTYPE, PdfName.TEXT);
annotation.put(PdfName.OPEN, PdfBoolean.PDFFALSE);
annotation.put(PdfName.NAME, new PdfName(text));
writer.addAnnotation(annotation);
}
And then use some like this to read it.
http://downloads.snowtide.com/javadoc/PDFTextStream/2.3.2/com/snowtide/pdf/PDFTextStream.html

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.

Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.

If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.

Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

How to generate a printable output for a phonebook

I'm developing a desktop software to manage people and telephones, and also to generate (export) a list of telephones (also with a summary of the cities) that can be printed (like pdf). The part of telephones management is ready and was made with java and swt/jface. Exporting the list in a print friendly format is what has become an issue.
I tried exporting the list in HTML with CSS, but the result is not the same in different browsers.
I was thinking about generating it in LaTeX, but creating an style is getting too complicated (need an A7 page size, smaller fonts...).
What file format can be used to export this list? Is there an easy way to generate printable stuff?
Edit: forgot to mention that the file will be sent to a company to be printed.
Thanks!

Generate a pdf, it will look the same no matter what browser they use. You can use iText to create the pdf, it is fairly straight forward for a simple pdf.

You could just draw an image, it will stay the same on different systems and its easy to print. by drawing it, you can style it like you imagine, without learning any document format. It should be easy to draw a simple table.

Plain text is a very friendly format for me. Altough, this could be done with HTML and CSS, if you keep the style complexity level to a minimum. Try reading:
http://www.smashingmagazine.com/2010/06/07/the-principles-of-cross-browser-css-coding/
And be careful when choosing your properties!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.