iText 7 - Merge PDF layers (OCG) with the base PDF

iText 7 - Merge PDF layers (OCG) with the base PDF - java

Is it possible to merge layers of a PDF (OCG) with the base PDF to result in a PDF without layers?
I saw that it's possible to accomplish this using an application as Adobe Acrobat DC using a "Flatten Layers" option but I need this programmed in my Java application using iText7.
EDIT:
#joelgeraci has a useful and good answer that solves the previous question, but I have initially some hidden layers that will be displayed anyway when removing the OCProperties from the catalog.

You don't actually need to "merge" the layers. All of the layer content is already part of the page content. Layers, or more properly Optional Content Groups, are sets of instructions that the viewer can either draw or not, depending on the settings, for viewers that don't support layers, they just all show. To "flatten" the layers, you just need to modify the PDF so that the viewer doesn't think there is any optional content. The easiest way is to delete the OCProperties dictionary from the Catalog. Once you have the catalog object, use "remove" passing the name of the OCPropreties dictionary.
catalog.remove(PdfName.OCPROPERTIES)

Related

PDF font subsetting and subset merging in Java

I have a part in my code where I am programatically filling out PDF forms using iText Java based on user-entered data, and then I concat a number of such PDFs into one using iText again.
The PDF forms that are getting merged can be (and usually are) different.
The resulting PDF is way too large - looking at it, 98% of the space is taken by fonts.
The way I understand it, what happens is that the individual PDF forms have different font subsets, so when I merge them, I get massive amount of duplicate glyphs, except that the subsets are not identical, so I can't get rid of them without merging the subsets.
The other problem is that the PDF forms themselves might not even contain subsets, but heavily packed fonts that have 2000+ glyphs, so even if I manage to leave only one instance of that font in the PDF, that still can be many megabytes. Hence it seems that I need to be able to 1) create and 2) merge existing font subsets.
The quirk is that I do not control neither the PDF forms (that are being filled out) nor their number, nor the order in which they are concatenated, so it is not possible to solve this by controlling what kind of fonts are embedded in them.
Adobe Acrobat can of course solve such a problem - it can create and also merge font subsets - but I need a programatic, server-side solution. According to google hits, iText cannot do this. Is there another library that I could use (or anything else I can do)?

Is it possible to append or merge one or more XFA form-based PDF files together with iText?

I have one PDF file that has an embedded form that is based on XFA (XML) forms. The first PDF has a table which holds a list of people. If that table overflows, the subsequent list of people are handled by an addendum page which is also a PDF (XFA based form). Is it possible to merge all XFA-based PDFs into one PDF using iText?

#BrunoLowagie Thanks for your response. Actually, I managed to get iText to concatenate PDF interactive forms to create a custom PDF packet. Let me explain how I did this.
From using Adobe Acrobat XI Pro, I learned that when the XFA PDF is loaded, I cannot edit the form if I go to Tools->Edit (it gives me the usual warning that this PDF was created by LiveCycle Designer), but when I go to Pages->Extract, then select all pages to be extracted, the whole XFA-based PDF is extracted and converted over to an AcroForms based PDF. So if I had 25 fields in the XFA-based PDF, it had successfully converted all 25 fields into AcroForm fields. Somehow Adobe Acrobat had to determine the variable names based on the XML structure. (i.e. xpath) //form1/Page1/variable1 was converted to acro field name: form1[0].Page1[0].variable1[0]. All visible (editable form) fields were present and aligned (pixel-perfect) as usual.
If I had flattened the XFA PDF, I would again need to place the form fields back on each page, which would be tedious. By using Pages->Extract->All Pages, it converts everything for me (no flattening needed - flattening which strips all fields also; also no XFA worker lib needed).
However, my PDF packet is static and I would like to repeat the addendum page in case data overflows from the 2nd page. I know I could have modified the initial XFA to handle this overflow, but the client wants to use the exact page look for the addendum, with headers/footers intact.
I found that I can achieve this by extracting the addendum page separately via Adobe Acrobat Pro->Pages->Extract->(Select Addendum Page), then it was converted to a PDF form w/AcroForm intact.
I took the original PDF packet and attempted to concatenated the addendum PDF page. So for the moment, the main packet has AcroForm fields and the Addendum PDF page also has AcroForm fields.
When I used PdfCopy or PdfConcatenate to do the concatenation, I noticed that I lost all form fields when I called form.getFields()
When I used the (deprecated) PdfCopyFields to do the concatenation, all AcroForm fields were intact. (Exactly what I needed!) I also tested the PdfCopyFields where some fields were filled/saved and the PdfCopyFields still worked & carried over the prefilled values over. I looked at the reason why PdfCopyFields was marked for deprecation, saying that we can either merge accessible files and lose the forms or merge the forms or lose the accessibility (tagged PDFs). What if I don't care about tagged PDFs or accessibility?....and still need PDFCopyFields to carry over the form fields intact. So far I'm forced to keep using the PDFCopyFields since it does exactly what I need for PDF concatenation with Interactive Form Fields. Is it possible to update PDFCopy to have an option to copy over the fields if PDFCopyFields is going away?

PDF Handling in Java

I have created a program that should one day become a PDF editor
It's purpose will be saving GUI's textual content to the PDF, and loading it from it. GUI resembles text editor, but it only has certain fields(JTextAreas, actually).
It can look like this (this is only one page, it can have many more, also upper and lower margins are cut out of the picture) It should actually resemble A4 in pixel size.
I have looked around for a bit for PDF libraries and found out that iText could suit my PDF creating needs, however, if I understood it correct, it retirevs text from a whole page as a string which won't work for me, because I will need to detect diferent fields/paragaphs/orsomething to be able to load them back into the program.
Now, I'm a bit lazy, but I don't want to spend hours going trough numerus PDF libraries just to find out that they won't work for me.
Instead, I'm asking someone with a bit more Java PDF handling experience to recommend me one according to my needs.
Or maybe recommend me how to add invisible parts to PDF which will help my program to determine where is it exactly situated insied a PDF file...
Just to be clear (I formed my question wrong before), only thing I need to put in my PDF is text, and that's all I need to later be able to get out. My program should be able to read PDF's which he created himself...
Also, because of the designated use of files created with this program, they need to be in the PDF format.

Short Answer: Use an intermediate format like JSON or XML.
Long Answer: You're using PDF's in a manner that they wasn't designed for. PDF's were not designed to store data; they were designed to present and format data in an portable form. Furthermore, a PDF is a very "heavy" way to store data. I suggest storing your data in another manner, perhaps in a format like JSON or XML.
The advantage now is that you are not tied to a specific output-format like PDF. This can come in handy later on if you decide that you want to export your data into another format (like a Word document, or an image) because you now have a common representation.
I found this link and another link that provides examples that show you how to store and read back metadata in your PDF. This might be what you're looking for, but again, I don't recommend it.

If you really insist on using PDF to store data, I suggest that you store the actual data in either XML or RDF and then attach that to the PDF file when you generate it. Then you can read the XML back for the data.

Assuming that your application will only consume PDF files generated by the same application, there is one part of the PDF specification called Marked Content, that was introduced precisely for this purpose. Using Marked Content you can specify the structure of the text in your document (chapter, paragraph, etc).
Read Chapter 14 - Document Interchange of the PDF Reference Document for more details.

How can I extract only the main textual content from an HTML page?

Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

You can also use boilerpipe to segment the text into blocks of full-text/non-full-text, instead of just returning one of them (essentially, boilerpipe segments first, then returns a String).
Assuming you have your HTML accessible from a java.io.Reader, just let boilerpipe segment the HTML and classify the segments for you:
Reader reader = ...
InputSource is = new InputSource(reader);
// parse the document into boilerpipe's internal data structure
TextDocument doc = new BoilerpipeSAXInput(is).getTextDocument();
// perform the extraction/classification process on "doc"
ArticleExtractor.INSTANCE.process(doc);
// iterate over all blocks (= segments as "ArticleExtractor" sees them)
for (TextBlock block : getTextBlocks()) {
// block.isContent() tells you if it's likely to be content or not
// block.getText() gives you the block's text
}
TextBlock has some more exciting methods, feel free to play around!

There appears to be a possible problem with Boilerpipe. Why?
Well, it appears that is suited to certain kinds of web pages, such as web pages that have a single body of content.
So one can crudely classify web pages into three kinds in respect to Boilerpipe:
a web page with a single article in it (Boilerpipe worthy!)
a web with multiple articles in it, such as the front page of the New York times
a web page that really doesn't have any article in it, but has some content in respect to links, but may also have some degree of clutter.
Boilerpipe works on case #1. But if one is doing a lot of automated text processing, then how does one's software "know" what kind of web page it is dealing with? If the web page itself could be classified into one of these three buckets, then Boilerpipe could be applied for case #1. Case #2 is a problem, and case#3 is a problem as well - it might require an aggregate of related web pages to determine what is clutter and what isn't.

You can use some libs like goose. It works best on articles/news.
You can also check javascript code that does similar extraction as goose with the readability bookmarklet

My first instinct was to go with your initial method of using Jsoup. At least with that, you can use selectors and retrieve only the elements that you want (i.e. Elements posts = doc.select("p"); and not have to worry about the other elements with random content.
On the matter of your other post, was the issue of false positives your only reasoning for straying away from Jsoup? If so, couldn't you just tweak the number of MIN_WORDS_SEQUENCE or be more selective with your selectors (i.e. do not retrieve div elements)

http://kapowsoftware.com/products/kapow-katalyst-platform/robo-server.php
Proprietary software, but it makes it very easy to extract from webpages and integrates well with java.
You use a provided application to design xml files read by the roboserver api to parse webpages. The xml files are built by you analyzing the pages you wish to parse inside the provided application (fairly easy) and applying rules for gathering the data (generally, websites follow the same patterns). You can setup the scheduling, running, and db integration using the provided Java API.
If you're against using software and doing it yourself, I'd suggest not trying to apply 1 rule to all sites. Find a way to separate tags and then build per-site

You're looking for what are known as "HTML scrapers" or "screen scrapers". Here are a couple of links to some options for you:
Tag Soup
HTML Unit

You can filter the html junk and then parse the required details or use the apis of the existing site.
Refer the below link to filter the html, i hope it helps.
http://thewiredguy.com/wordpress/index.php/2011/07/dont-have-an-apirip-dat-off-the-page/

You could use the textracto api, it extracts the main 'article' text and there is also the opportunity to extract all other textual content. By 'subtracting' these texts you could split the navigation texts, preview texts, etc. from the main textual content.

rendering pdf on webside ala google documents

In a current project i need to display PDFs in a webpage. Right now we are embedding them with the Adobe PDF Reader but i would rather have something more elegant (the reader does not integrate well, it can not be overlaid with transparent regions, ...).
I envision something close google documents, where they display PDFs as image but also allow text to be selected and copied out of the PDF (an requirement we have).
Does anybody know how they do this? Or of any library we could use to obtain a comparable result?
I know we could split the PDFs into images on server side, but this would not allow for the selection of text ...
Thanks in advance for any help
PS: Java based project, using wicket.

I have some suggestions, but it'll be definitely hard to implement this stuff. Good luck!
First approach:
First, use a library like pdf-renderer (https://pdf-renderer.dev.java.net/) to convert the PDF into an image. Store these images on your server or use a caching-technique. Converting PDF into an image is not hard.
Then, use the Type Select JavaScript library (http://www.typeselect.org/) to overlay textual data over your text. This text is selectable, while the real text is still in the original image. To get the original text, see the next approach, or do it yourself, see the conclusion.
The original text then must be overlaid on the image, which is a pain.
Second approach:
The PDF specifications allow textual information to be linked to a Font. Most documents use a subset of Type-3 or Type-1 fonts which (often) use a standard character set (I thought it was Unicode, but not sure). If your PDF document does not contain a standard character set, (i.e. it has defined it's own) it's impossible to know what characters are which glyphs (symbols) and thus are you unable to convert to a textual representation.
Read the PDF document, read the graphics-objects, parse the instructions (use the PDF specification for more insight in this process) for rendering text, converting them to HTML. The HTML conversion can select appropriate tags (like <H1> and <p>, but also <b> and <i>) based on the parameters of the fonts (their names and attributes) used and the instructions (letter spacing, line spacing, size, face) in the graphics-objects.
You can use the pdf-renderer library for reading and parsing the PDF files and then code a HTML translator yourself. This is not easy, and it does not cover all cases of PDF documents.
In this approach you will lose the original look of the document. There are some PDF generation libraries which do not use the Adobe Font techniques. This also is a problem with the first approach, even you can see it you can not select it (but equal behavior with the official Adobe Reader, thus not a big deal you'd might say).
Conclusion:
You can choose the first approach, the second approach or both.
I wouldn't go in the direction of Optical Character Recognition (OCR) since it's really overkill in such a problem, since it also has several drawbacks. This approach is Google using. If there are characters which are unrecognized, a human being does the processing.
If you are into the human-processing thing; you can only use the Type Select library and PDF to Image conversion and do the OCR yourself, which is probably the easiest (human as a machine = intelligently cheap, lol) way to solve the problem.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.