I want to print xml in pdf using itext in java, as well formatted and displayed in color and indention as well like shown in notepad++,
any api or suggestion regarding this?
I have converted XHTML to pdf, via iText, using flying saucer for the rendering (previously xhtml renderer).
http://code.google.com/p/flying-saucer/
You can format using CSS, though I do remember it's slightly temperamental, however you can tweak it to get what you want, and end up with something nicely formatted.
I wasn't sure whet you meant regarding Notepad++ - I don't have PDF support there, just opens as Binary file contents, unless there is a PDF plugin you use?
::Answer updated after comments below.
Thanks for the comment, I understand the question much better now. I thought you wanted to output the data in the XML in the PDF, now I understand you want to see the raw XML itself in the PDF, formatted as you'd see XML formatted in Notepad, colours and all.
XML is a markup language designed to describe data, so you want to get this into a language that can descibe the presentation and style as well as the data. I'd suggest
1) Convert the XML to XHTML - so all the XML (tags, attributes) is your content, and you have classes describing each type (for example, attribute names, attribute values, starter tag, end tag). I don't know if you can use an XSLT library to transform it this way, oterwise you can write something yourself in Java, walking through the DOM and output it in the way you want. This way you can
2) Create CSS to style your classes as you want - e.g. have all attribute names as text color "red"
3) Use iText and flying saucer as above to convert the XHTML and CSS into PDF using Java, as described in original answer
Related
I have used itext in Java to convert a HTML to PDF.
Now I want to test if the PDF generated by me is correct i.e the positions and contents all are correct and at correct positions.
Is there away to do the testing of my code?
Basically, Your question is about validating itext output.
If You do not trust library for converting HTML to PDF, You probably do not trust reading raw PDF data as well. You can therefore use other libraries (PDF clown) for parsing PDF as a validation.
You have 2 approaches.
First one requires rasterization of PDF (GhostScript) and comparing to HTML. Indeed, the performance overhead is significant.
Second one parses the document format. I have gone into depth in my previous answer about searching for text inside PDF file.
I have mentioned there searching for text as well as finding it's position on page.
I would suggest just simply avoid validating of output, unless You know something is wrong.
These libraries are widely-used and well-tested.
All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.
I want to extract text from crawled html web pages. I am using the excellent open source Boilerpipe library to do just that. However, with Boilerpipe I am getting only the raw text. In addition to the raw text, I need to capture the text with original source formatting information with all css styling info inlined.
Is there a way to do this with Boilerpipe or any other java library, preferably open source?
I should start by saying that I've never used Boilerpipe ... or even heard of it until now.
But looking at the website and the javadocs, I'd say that you can't use it to extract text with styling. The basic conceptual problem is how that styling would / could be represented. For example, the BoilerpipeExtractor interface has 4 getText methods, and each of those methods returns the extracted text as a String. How would you represent styling in a String? You'd have to embed some kind of markup, but ...
what kind of markup, and
how would you reconcile this with the description of the interface, which says that the methods return "text" ... not "text with markup".
So, my assessment is that using Boilerpipe to extract text with styling is a complete non-starter. So go with the other alternatives you've already identified.
I have an xml file with which i want to print as a PDF using PCL. I am new to PCL. Can i use PCL to get the xml printed in PDF format directly or should i have some intermediate process to create a PDF file and then use PCL to get it printed as PDF?
If you have a xml, there are two ways to recieve PDF file.
1. Create stylesheet for your xml, and use XEP
or
2. use just your xml and VisualXSL, which will help you create your pdf for print.
More additional: If you will create your xsl stylsheet, you can format by XEP many type of PDFs, for example PDF/1A, or another levels
Both XEP and VisualXSL are Renderx products(http://www.renderx.com/tools/index.html) and they have trial versions, that you can use:). I have used both products many times, and was satisfied.
You can also visit the forum where you can find answers about how to use and how usefull are products described above. http://cooltools.renderx.com
PCL is a printer control language. In other words command bytes you send to a (usually HP) printer which is then converted to ink on a page. This is normally not the way you will generate a PDF since too much information from the original will be lost.
You will normally want to convert your XML to something describing the actual print you want to have. A reasonable choice for this is the XSL-FO XML dialect which, however, is not very nice to do by hand. You can then choose to convert your XML into DocBook XML which in turn has very nice style sheets for converting further on to XSL-FO and other formats.
You can then use Apache FOP to convert XSL-FO into many formats, one being PDF. This allows you to - if FOP gets too small - to replace with one of several commercial XSL_FO rendering engines at a later date.
In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.
I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.