EDITED
I'm using some code to extract fields and data of a signed PDF. It's a signed PDF with several forms with some fields and their values.
But I'm only getting the signature field/value with the current code:
/**
* This will print all the fields from the document.
*
* #param pdfDocument The PDF to get the fields from.
*
* #throws IOException If there is an error getting the fields.
*/
public void printFields(PDDocument pdfDocument) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}
But getting
1 top-level fields were found on the form
|--ENVELOPEID_EE81E3AFDC0143968B9D253C5DEA7A2B.ENVELOPEID_EE81E3AFDC0143968B9D253C5DEA7A2B = org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature#1c53fd30, type=org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField
How can I get the contents of the PDF?
Regards
Related
I'm trying to parse docx file that contains content control fields (that are added using window like this, reference image, mine is on another language)
I'm using library APACHE POI. I found this question on how to do it. I used the same code:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import java.util.List;
import java.util.ArrayList;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;
public class ReadWordForm {
private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));
List<XWPFSDT> allsdts = extractSDTsFromBody(document);
for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
}
}
document.close();
}
}
The problem is that this code doesn't see these fields in the my docx file until I open it in LibreOffice and re-save it. So if the file is from Windows being put into this code it doesn't see these content control fields. But if I re-save the file in the LibreOffice (using the same format) it starts to see these fields, even tho it loses some of the data (titles and tags of some fields). Can someone tell me what might be the reason of it, how do I fix that so it will see these fields? Or there's an easier way using docx4j maybe? Unfortunately there's not much info about how to do it using these 2 libs in the internet, at least I didn't find it.
Examle files are located on google disk. The first one doesn't work, the second one works (after it was opened in Libre and field was changed to one of the options).
According to your uploaded sample files your content controls are in a table. The code you had found only gets content controls from document body directly.
Tables are beastly things in Word as table cells may contain whole document bodies each. That's why content controls in table cells are strictly separated from content controls in main document body. Their ooxml class is CTSdtCell instead of CTSdtRun or CTSdtBlock and in apache poi their class is XWPFSDTCell instead of XWPFSDT.
If it is only about reading the content, then one could fall back to XWPFAbstractSDT which is the abstract parent class of XWPFSDTCell as well as of XWPFSDT. So following code should work:
private static List<XWPFAbstractSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFAbstractSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFAbstractSDT> allsdts = new ArrayList<XWPFAbstractSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtCell) {
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null);
//System.out.println("cell: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
But as you see in code line sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null), the XWPFSDTCell totaly lost its connection to table and tablerow.
There is not a proper method to get the XWPFSDTCell directly from a XWPFTable. So If one would need to get XWPFSDTCell connected to its table, then also parsing the XML is needed. This could look like so:
private static List<XWPFSDTCell> extractSDTsFromTableRow(XWPFTableRow row) {
XWPFSDTCell sdt;
XmlCursor xmlcursor = row.getCtRow().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
QName qnameTr = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "tr", "w");
List<XWPFSDTCell> allsdts = new ArrayList<XWPFSDTCell>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
if (xmlcursor.getObject() instanceof CTSdtCell) {
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), row, row.getTable().getBody());
//System.out.println("cell: " + sdt);
allsdts.add(sdt);
}
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the table row
xmlcursor.push();
xmlcursor.toParent();
if (qnameTr.equals(xmlcursor.getName())) {
break;
}
xmlcursor.pop();
}
}
return allsdts;
}
And called from document like so:
...
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
List<XWPFSDTCell> allTrsdts = extractSDTsFromTableRow(row);
for (XWPFSDTCell sdt : allTrsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
System.out.println(content);
}
}
}
}
...
Using curren apache poi 5.2.0 it is possible getting the XWPFSDTCell from XWPFTableRow via XWPFTableRow.getTableICells. This gets al List of ICells which is an interface which XWPFSDTCell also implemets.
So following code will get all XWPFSDTCell from tables without the need of low level XML parsing:
...
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (ICell iCell : row.getTableICells()) {
if (iCell instanceof XWPFSDTCell) {
XWPFSDTCell sdt = (XWPFSDTCell)iCell;
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
System.out.println(content);
}
}
}
}
}
...
I am using Apache PDFBox 2.0.8 and trying to remove one field. But can not find the way to do it, like I can do with iText: PdfStamper.getAcroFields().removeField("signature3").
What I am tying to do. Initially I have template PDF with 3 Digital Signatures. In some cases I need just 2 signatures, so it this case I need to remove 3rd signature from the template. And seems like I can't do it with PDFBox, close thing I found is flattening this field, but that problem is if a flatten particular PDField (not whole form, but just one field) - all other signatures are loosing their functionality, looks like they are getting flattened as well.
Here is code that does it:
PDDocument document = PDDocument.load(file);
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = documentCatalog.getAcroForm();
List<PDField> flattenList = new ArrayList<>();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDSignatureField && "signature3".equals(field.getFullyQualifiedName())) {
flattenList.add(field);
}
}
acroForm.flatten(flattenList, true);
document.save(dest);
document.close();
As Tilman already mentioned in a comment, PDFBox doesn't have a method to remove a field from the field tree. Nonetheless it has methods to manipulate the underlying PDF structure, so one can write such a method oneself, e.g. like this:
PDField removeField(PDDocument document, String fullFieldName) throws IOException {
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = documentCatalog.getAcroForm();
if (acroForm == null) {
System.out.println("No form defined.");
return null;
}
PDField targetField = null;
for (PDField field : acroForm.getFieldTree()) {
if (fullFieldName.equals(field.getFullyQualifiedName())) {
targetField = field;
break;
}
}
if (targetField == null) {
System.out.println("Form does not contain field with given name.");
return null;
}
PDNonTerminalField parentField = targetField.getParent();
if (parentField != null) {
List<PDField> childFields = parentField.getChildren();
boolean removed = false;
for (PDField field : childFields)
{
if (field.getCOSObject().equals(targetField.getCOSObject())) {
removed = childFields.remove(field);
parentField.setChildren(childFields);
break;
}
}
if (!removed)
System.out.println("Inconsistent form definition: Parent field does not reference the target field.");
} else {
List<PDField> rootFields = acroForm.getFields();
boolean removed = false;
for (PDField field : rootFields)
{
if (field.getCOSObject().equals(targetField.getCOSObject())) {
removed = rootFields.remove(field);
break;
}
}
if (!removed)
System.out.println("Inconsistent form definition: Root fields do not include the target field.");
}
removeWidgets(targetField);
return targetField;
}
void removeWidgets(PDField targetField) throws IOException {
if (targetField instanceof PDTerminalField) {
List<PDAnnotationWidget> widgets = ((PDTerminalField)targetField).getWidgets();
for (PDAnnotationWidget widget : widgets) {
PDPage page = widget.getPage();
if (page != null) {
List<PDAnnotation> annotations = page.getAnnotations();
boolean removed = false;
for (PDAnnotation annotation : annotations) {
if (annotation.getCOSObject().equals(widget.getCOSObject()))
{
removed = annotations.remove(annotation);
break;
}
}
if (!removed)
System.out.println("Inconsistent annotation definition: Page annotations do not include the target widget.");
} else {
System.out.println("Widget annotation does not have an associated page; cannot remove widget.");
// TODO: In this case iterate all pages and try to find and remove widget in all of them
}
}
} else if (targetField instanceof PDNonTerminalField) {
List<PDField> childFields = ((PDNonTerminalField)targetField).getChildren();
for (PDField field : childFields)
removeWidgets(field);
} else {
System.out.println("Target field is neither terminal nor non-terminal; cannot remove widgets.");
}
}
(RemoveField helper methods removeField and removeWidgets)
One can apply this to a document and field like this:
PDDocument document = PDDocument.load(SOURCE_PDF);
PDField field = removeField(document, "Signature1");
Assert.assertNotNull("Field not found", field);
document.save(TARGET_PDF);
document.close();
(RemoveField test testRemoveInvisibleSignature)
PS: I am not sure how much form related information PDFBox actually caches somewhere. Thus, I would propose not to manipulate the form information any further in the same document manipulation session, at least not without tests.
PPS: You find a TODO in the removeWidgets helper method. If the method outputs "Widget annotation does not have an associated page; cannot remove widget", you'll have to add the missing code.
Thanks to #mkl, I managed to do so with a shorter implementation using version pdfbox-3.0.0-RC1. in this case, to hide a button (check):
var check = (PDPushButton) pdAcroForm.getField(name);
List<PDField> fields = pdAcroForm.getFields();
fields.removeIf(x -> x.getCOSObject().equals(check.getCOSObject()));
pdAcroForm.setFields(fields);
check.getWidgets().forEach(widget -> widget.setNoView(true));
I'm getting a error with ChatColor.translateAlternateColorCodes since I added a custom config to my plugin.
Here is the error:
Caused by: java.lang.NullPointerException
at org.bukkit.ChatColor.translateAlternateColorCodes(ChatColor.java:206)
~[spigot.jar:git-Spigot-1473]
at com.gmail.santiagoelheroe.LoginVip.<init>(LoginVip.java:44) ~[?:?]
The error says the problem is at line 44 inside LoginVip class.
YamlConfiguration configuracion = YamlConfiguration.loadConfiguration(configFile);
String textpermisos = configuracion.getString("Configuration.NoPermissionsMessage");
// Line 44
String permisos = (ChatColor.translateAlternateColorCodes('&', textpermisos));
String prefixtext = configuracion.getString("Configuration.Prefix");
String prefix = (ChatColor.translateAlternateColorCodes('&', prefixtext));
I have to fix this error to finish my first plugin.
Config.class:
import java.io.File;
import java.util.logging.Level;
import org.bukkit.configuration.file.YamlConfiguration;
public class Config {
public static File configFile = new File("Plugins/LoginVip/config.yml");
public static void load() {
YamlConfiguration spawn = YamlConfiguration.loadConfiguration(configFile);
}
public static void saveConfig() {
YamlConfiguration configuracion = new YamlConfiguration();
configuracion.set("Configuration.NoPermissionsMessage", "&cYou don't have permissions to do that");
try {
configuracion.save(configFile);
} catch (Exception e) {
LoginVip.log.log(Level.WARNING, "[LV] Error creating Config.yml file");
}
}
}
onEnable:
#Override
public void onEnable() {
log.log(Level.INFO, "[LV] Plugin loaded");
if(!Config.configFile.exists()) {
Config.saveConfig();
}
if(!Config.spawnFile.exists()) {
Config.saveSpawn();
}
Config.load();
}
textpermisos is null. From the Javadocs of MemoryConfiguration.getString(String):
Gets the requested String by path.
If the String does not exist but a default value has been specified, this will return the default value. If the String does not exist and no default value was specified, this will return null.
This means that your configuration file does not contain the key-value mapping for "Configuration.NoPermissionsMessage". It is null, which is then passed into ChatColor.translateAlternateColorCodes(char, String). Here is its source code, with a comment of mine indicating which line ChatColor.java:206 in your crash log was:
/*
* Translates a string using an alternate color code character into a
* string that uses the internal ChatColor.COLOR_CODE color code
* character. The alternate color code character will only be replaced if
* it is immediately followed by 0-9, A-F, a-f, K-O, k-o, R or r.
*
* #param altColorChar The alternate color code character to replace. Ex: &
* #param textToTranslate Text containing the alternate color code character.
* #return Text containing the ChatColor.COLOR_CODE color code character.
*/
public static String translateAlternateColorCodes(char altColorChar, String textToTranslate) {
char[] b = textToTranslate.toCharArray(); // textToTranslate is null, it causes a NPE to be thrown.
for (int i = 0; i < b.length - 1; i++) {
if (b[i] == altColorChar && "0123456789AaBbCcDdEeFfKkLlMmNnOoRr".indexOf(b[i+1]) > -1) {
b[i] = ChatColor.COLOR_CHAR;
b[i+1] = Character.toLowerCase(b[i+1]);
}
}
return new String(b);
}
To solve this:
Add default mapping so getString() would not return null but instead a default value. Here is one way to do this (consult the documentation for applying as HashMap):
YamlConfiguration configuracion = YamlConfiguration.loadConfiguration(configFile);
String defpermisos = "";
String textpermisos = configuracion.getString("Configuration.NoPermissionsMessage", defpermisos);
String permisos = ChatColor.translateAlternateColorCodes('&', textpermisos);
String defprefix = "";
String textprefix = configuracion.getString("Configuration.Prefix", defprefix);
String prefix = ChatColor.translateAlternateColorCodes('&', textprefix);
Modify your code to only translate color codes after a != null check.
YamlConfiguration configuracion = YamlConfiguration.loadConfiguration(configFile);
String textpermisos = configuracion.getString("Configuration.NoPermissionsMessage");
String permisos = null;
if (textpermisos != null)
permisos = ChatColor.translateAlternateColorCodes('&', textpermisos);
String prefixtext = configuracion.getString("Configuration.Prefix");
String prefix = null;
if (prefixtext != null)
prefix = ChatColor.translateAlternateColorCodes('&', prefixtext);
I'm using pdfbox for the first time. Now I'm reading something on the website Pdf
Summarizing I have a pdf like this:
only that my file has many and many different component(textField,RadionButton,CheckBox). For this pdf I have to read these values : Mauro,Rossi,MyCompany. For now I wrote the following code:
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
for(PDField pdField : pdAcroForm.getFields()){
System.out.println(pdField.getValue())
}
Is this a correct way to read the value inside the form component?
Any suggestion about this?
Where can I learn other things on pdfbox?
The code you have should work. If you are actually looking to do something with the values, you'll likely need to use some other methods. For example, you can get specific fields using pdAcroForm.getField(<fieldName>):
PDField firstNameField = pdAcroForm.getField("firstName");
PDField lastNameField = pdAcroForm.getField("lastName");
Note that PDField is just a base class. You can cast things to sub classes to get more interesting information from them. For example:
PDCheckbox fullTimeSalary = (PDCheckbox) pdAcroForm.getField("fullTimeSalary");
if(fullTimeSalary.isChecked()) {
log.debug("The person earns a full-time salary");
} else {
log.debug("The person does not earn a full-time salary");
}
As you suggest, you'll find more information at the apache pdfbox website.
The field can be a top-level field. So you need to loop until it is no longer a top-level field, then you can get the value. Code snippet below loops through all the fields and outputs the field names and values.
{
//from your original code
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
//get all fields in form
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
//inspect field values
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
...
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
//field has no child. output the value
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}
I use OpenCmis in-memory for testing. But when I create a document I am not allowed to set the versioningState to something else then versioningState.NONE.
The doc created is not versionable some way... I used the code from http://chemistry.apache.org/java/examples/example-create-update.html
The test method:
public void test() {
String filename = "test123";
Folder folder = this.session.getRootFolder();
// Create a doc
Map<String, Object> properties = new HashMap<String, Object>();
properties.put(PropertyIds.OBJECT_TYPE_ID, "cmis:document");
properties.put(PropertyIds.NAME, filename);
String docText = "This is a sample document";
byte[] content = docText.getBytes();
InputStream stream = new ByteArrayInputStream(content);
ContentStream contentStream = this.session.getObjectFactory().createContentStream(filename, Long.valueOf(content.length), "text/plain", stream);
Document doc = folder.createDocument(
properties,
contentStream,
VersioningState.MAJOR);
}
The exception I get:
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The versioning state flag is imcompatible to the type definition.
What am I missing?
I found the reason...
By executing the following code I discovered that the OBJECT_TYPE_ID 'cmis:document' don't allow versioning.
Code to view all available OBJECT_TYPE_ID's (source):
boolean includePropertyDefintions = true;
for (t in session.getTypeDescendants(
null, // start at the top of the tree
-1, // infinite depth recursion
includePropertyDefintions // include prop defs
)) {
printTypes(t, "");
}
static void printTypes(Tree tree, String tab) {
ObjectType objType = tree.getItem();
println(tab + "TYPE:" + objType.getDisplayName() +
" (" + objType.getDescription() + ")");
// Print some of the common attributes for this type
print(tab + " Id:" + objType.getId());
print(" Fileable:" + objType.isFileable());
print(" Queryable:" + objType.isQueryable());
if (objType instanceof DocumentType) {
print(" [DOC Attrs->] Versionable:" +
((DocumentType)objType).isVersionable());
print(" Content:" +
((DocumentType)objType).getContentStreamAllowed());
}
println(""); // end the line
for (t in tree.getChildren()) {
// there are more - call self for next level
printTypes(t, tab + " ");
}
}
This resulted in a list like this:
TYPE:CMIS Folder (Description of CMIS Folder Type) Id:cmis:folder
Fileable:true Queryable:true
TYPE:CMIS Document (Description of CMIS Document Type)
Id:cmis:document Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:My Type 1 Level 1 (Description of My Type 1 Level 1 Type)
Id:MyDocType1 Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:VersionedType (Description of VersionedType Type)
Id:VersionableType Fileable:true Queryable:true [DOC Attrs->]
Versionable:true Content:ALLOWED
As you can see the last OBJECT_TYPE_ID has versionable: true... and when I use that it does work.