Print XML as PDF using PCL - java

I have an xml file with which i want to print as a PDF using PCL. I am new to PCL. Can i use PCL to get the xml printed in PDF format directly or should i have some intermediate process to create a PDF file and then use PCL to get it printed as PDF?

If you have a xml, there are two ways to recieve PDF file.
1. Create stylesheet for your xml, and use XEP
or
2. use just your xml and VisualXSL, which will help you create your pdf for print.
More additional: If you will create your xsl stylsheet, you can format by XEP many type of PDFs, for example PDF/1A, or another levels
Both XEP and VisualXSL are Renderx products(http://www.renderx.com/tools/index.html) and they have trial versions, that you can use:). I have used both products many times, and was satisfied.
You can also visit the forum where you can find answers about how to use and how usefull are products described above. http://cooltools.renderx.com

PCL is a printer control language. In other words command bytes you send to a (usually HP) printer which is then converted to ink on a page. This is normally not the way you will generate a PDF since too much information from the original will be lost.
You will normally want to convert your XML to something describing the actual print you want to have. A reasonable choice for this is the XSL-FO XML dialect which, however, is not very nice to do by hand. You can then choose to convert your XML into DocBook XML which in turn has very nice style sheets for converting further on to XSL-FO and other formats.
You can then use Apache FOP to convert XSL-FO into many formats, one being PDF. This allows you to - if FOP gets too small - to replace with one of several commercial XSL_FO rendering engines at a later date.

Related

Why extracting tables in a converted docx work better than in the original PDF?

I'm trying to perform automaticaly table extraction inside PDF. I know there are several libraries and methods Java and Python, but to my surprise, the method that has worked best for me is to convert my Pdf to a Docx document and from there to extract the tables (thanks to: How to get pictures and tables from .docx document using apache poi?).
My question is this: Assuming that within the format conversion there may be loss of information, why are my results better this way? Tabula hasn't been able to do better automatically. To understand this, I have looked for information (e.g. Extracting table contents from a collection of PDF files) but I'm still very confused.
PD: For the moment, I have used https://github.com/thoqbk/traprange (A method based on Pdfbox), How to extract table as text from the PDF using Python? (PyPdf2) and Tabula. When I get to my home I going to put code and cases, I'm writing from my smartphone.

Java: Converting a PostScript File into text

Is there a Java Library that converts a PostScrpit File ".ps" into a String or TextFile (or something I can read with an InputStream)?
I have these Files and need to read them and handle them accourding to the Text in it. They allway contain only Text and usually its just one line like
date:SWYgeW91IHJlYWQgdGhpcyB5b3UncmUgcHJvYmFibGUgdG8gY3VyaW91cyAgYnV0IG5pY2UgdHJ5IGFueXdheS4gUGxlYXNlIEhlbHA=
in it.
Right now I convert it into a PDF and "read" it with an OCR Engine. But it seems a litte bit over the top for just one line.
Is there an other way to do it?
If you could point me in the right direction, that would be great.
PostScript is a language to define graphical output on paper, to a printer device. As such it does not really contain plaintext, and "extracting" text from it poses problems. It could for instance be programmatically determined in places, or it could be interspersed with PS code making the text data useless.
Normally you would output a modified PS to a printer (real or virtual) with a specific config that leads the result to be output as a standard text sequence (without the graphical formatting).
This is often done by altering the PS code file, to alter the text output command.
A desciption of this method can be found in part 3 of following Waikato Uni PM
http://www.cs.waikato.ac.nz/~ihw/papers/98NM-Reed-IHW-Extract-Text.pdf
If you convert the PostScript file to PDF (for example, with Ghostscript ps2pdf or with Acrobat Distiller), you could then read this file using iText (http://itextpdf.com). You could also convert the PDF into a more readable form using RUPS, one of the iText tools.

How to test the PDF generated using iText?

I have used itext in Java to convert a HTML to PDF.
Now I want to test if the PDF generated by me is correct i.e the positions and contents all are correct and at correct positions.
Is there away to do the testing of my code?
Basically, Your question is about validating itext output.
If You do not trust library for converting HTML to PDF, You probably do not trust reading raw PDF data as well. You can therefore use other libraries (PDF clown) for parsing PDF as a validation.
You have 2 approaches.
First one requires rasterization of PDF (GhostScript) and comparing to HTML. Indeed, the performance overhead is significant.
Second one parses the document format. I have gone into depth in my previous answer about searching for text inside PDF file.
I have mentioned there searching for text as well as finding it's position on page.
I would suggest just simply avoid validating of output, unless You know something is wrong.
These libraries are widely-used and well-tested.

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.
Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.
If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.
Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

Print xml in pdf using itext

I want to print xml in pdf using itext in java, as well formatted and displayed in color and indention as well like shown in notepad++,
any api or suggestion regarding this?
I have converted XHTML to pdf, via iText, using flying saucer for the rendering (previously xhtml renderer).
http://code.google.com/p/flying-saucer/
You can format using CSS, though I do remember it's slightly temperamental, however you can tweak it to get what you want, and end up with something nicely formatted.
I wasn't sure whet you meant regarding Notepad++ - I don't have PDF support there, just opens as Binary file contents, unless there is a PDF plugin you use?
::Answer updated after comments below.
Thanks for the comment, I understand the question much better now. I thought you wanted to output the data in the XML in the PDF, now I understand you want to see the raw XML itself in the PDF, formatted as you'd see XML formatted in Notepad, colours and all.
XML is a markup language designed to describe data, so you want to get this into a language that can descibe the presentation and style as well as the data. I'd suggest
1) Convert the XML to XHTML - so all the XML (tags, attributes) is your content, and you have classes describing each type (for example, attribute names, attribute values, starter tag, end tag). I don't know if you can use an XSLT library to transform it this way, oterwise you can write something yourself in Java, walking through the DOM and output it in the way you want. This way you can
2) Create CSS to style your classes as you want - e.g. have all attribute names as text color "red"
3) Use iText and flying saucer as above to convert the XHTML and CSS into PDF using Java, as described in original answer

Categories