How to extract bold text from pdf using pdfbox?

How to extract bold text from pdf using pdfbox? - java

I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be good!!!)
Here is the code for extracting plain text from pdf that is working fine.
PDDocument document = PDDocument
.load("/home/lipu/workspace/MRCPTester/test.pdf");
document.getClass();
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
stripper.setSortByPosition(true);
String st = stripper.getText(document);

The result of PDFTextStripper is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.
In case of the PDFTextStripper you have to override
protected void processTextPosition( TextPosition text )
In your override you check whether the text in question fulfills your requirements (TextPosition contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text to the super implementation.
The main problem is, though, to recognize which text is bold.
Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont() and the postscript name of the font using the font's getBaseFont() method
String postscriptName = text.getFont().getBaseFont();
Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor method, and a font descriptor has an optional font weight value
float fontWeight = text.getFont().getFontDescriptor().getFontWeight();
The value is defined as
(Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.
The specific interpretation of these values varies from font to font.
EXAMPLE 300 in one font may appear most similar to 500 in another.
(Table 122, Section 9.8.1, ISO 32000-1)
There may be additional hints towards bold-ism to check, e.g. a big line width
double lineWidth = getGraphicsState().getLineWidth();
when the rendering mode draws an outline, too:
int renderingMode = getGraphicsState().getTextState().getRenderingMode();
You may have to try with your the documents you have at hand which criteria suffice.

Related

How to render Colored Text in Apache PDFBox

Yes, it seems a weird question, but I was not able to render a colored text in PDFBox.
Usually the code for generating text looks like that:
//create some document and page...
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDRectangle.A4);
//defined some font
PDFont helveticaRegular = PDType1Font.HELVETICA;
//content stream for writing the text
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
contentStream.setFont(helveticaRegular, 16);
contentStream.setStrokingColor(1f,0.5f,0.2f);
contentStream.newLineAtOffset(64, page.getMediaBox().getUpperRightY() - 64);
contentStream.showText("The hopefully colored text");
contentStream.endText();
//closing the stream
contentStream.close();
[...] //code for saving and closing the document. Nothing special
Interestingly the setStrokingColor is the only method accepting colors on the stream.
So I thought this is the way to color something in PDFBox.
BUT: I'm not getting any color to the text. So I guess this is a method for other types of content.
Does anybody know how to achieve a colored text in PDFBox?

You use
contentStream.setStrokingColor(1f,0.5f,0.2f);
But in PDFs text by default is not drawn by stroking a path but by filling it. Thus, you should try
contentStream.setNonStrokingColor(1f,0.5f,0.2f);
instead.

While parsing pdf with iText7 chars move on fixed interval (with Freeset font)

I'm trying to parse pdf that I have created with iText. In document I have two paragraphs:
"Имя" - ("name" from Russian) - font: Helvetica, size: 20.
"Фамилия" - ("surname" from Russian) - font: Freeset (I downloaded it here), size: 10.
When I finish parsing I get "Имя" properly encoded and "Ôàìèëèÿ" instead of "Фамилия". It is Unicode characters for "Фамилия" but moved 848 chars (10-based) left. (I mean that, for instance, instead of "Ф" (0x0424 in UTF-8) I get "Ô" (0x00d4) and difference between them is 848 (or 350 in hex))
I use this example to get text from pdf (but instead of filtering by font, I filter by equality to one of the Strings in the set ("Имя", "Фамилия")
I know that we are advised to store non-English charactes as sequence of Unicode symbols, but I'm creating pdf on the fly from incoming data so I can't manually retype it as separate Unicode symbols (if you know how to do it on the fly, please provide your approach).
Any ideas why this movement of character happen and how to avoid it are welcomed. Thank you in advance.
Here is the file I worked with.
Edit
I tried opening file in Acrobat Pro and everything is fine there. Acrobat also shows that all three fonts I've put in pdf are still in the document.
Here is the code I used to create pdf I'm processing:
private static void create() throws IOException {
PdfDocument pdf = new PdfDocument(new PdfReader(srcPdf), new PdfWriter(targetPdf));
PdfCanvas pdfCanvas = new PdfCanvas(pdf.getFirstPage());
PdfFont freeset = getPdfFont(freesetPath);
PdfFont helvetica = getPdfFont(helveticaPath);
PdfFont circe = getPdfFont(circePath);
pdfCanvas.beginText()
.setFontAndSize(helvetica, 15)
.setColor(Color.RED, true)
.moveText(50, 300)
.showText("Имя")
.setFontAndSize(freeset, 10)
.setColor(Color.GREEN, true)
.moveText(0, -30)
.showText("Фамилия")
.setFontAndSize(circe, 20)
.setColor(Color.BLUE, true)
.moveText(0, -30)
.showText("Должность")
.endText();
pdf.close();
}
private static PdfFont getPdfFont(String path) throws IOException {
InputStream fontInputStream = new FileInputStream(path);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buffer = new byte[2048];
int a;
while((a = fontInputStream.read(buffer, 0, buffer.length)) != -1) {
baos.write(buffer, 0, a);
}
baos.flush();
return PdfFontFactory.createFont(baos.toByteArray(),
PdfEncodings.IDENTITY_H, true);
}

iText 7 appears to have an issue with embedding the font in question. I don't know whether it's a bug in the font or in iText, though.
The "FreeSet" font is indeed embedded in the OP's sample document with a wrong ToUnicode map
...
6 beginbfrange
<009e> <009e> <00d4> <00aa> <00aa> <00e0> <00b2> <00b2> <00e8> <00b5> <00b5> <00eb> <00b6> <00b6> <00ec> <00c9> <00c9> <00ff> endbfrange
...
which maps the glyphs used for "Фамилия" to 00d4, 00e0, 00e8, 00eb, 00ec, and 00ff.
This in turn explains why both iText and Adobe Reader extract unexpected text.
The issue can be reproduced like this:
PdfFont arial = PdfFontFactory.createFont(BYTES_OF_ARIAL_FONT, PdfEncodings.IDENTITY_H, true);
PdfFont freeSet = PdfFontFactory.createFont(BYTES_OF_FREESET_FONT, PdfEncodings.IDENTITY_H, true);
try ( OutputStream result = new FileOutputStream("cyrillicTextFreeSet.pdf");
PdfWriter writer = new PdfWriter(result);
PdfDocument pdfDocument = new PdfDocument(writer);
Document doc = new Document(pdfDocument) ) {
doc.add(new Paragraph("Фамилия").setFont(arial));
doc.add(new Paragraph("Фамилия").setFont(freeSet));
}
(CreateCyrillicText test testCreateTextWithFreeSet)
The result looks ok:
When extracting / copying&pasting, though:
The embedded Arial subset has a proper ToUnicode map, the text in Arial is extracted as "Фамилия".
The embedded FreeSet subset has an incorrect ToUnicode map, the text in FreeSet is extracted as "Ôàìèëèÿ".
(Tested with the current iText 7.1.1-SNAPSHOT)
Apparently iText 7 does understand the FreeSet font program well enough to select the needed subset and reference the correct glyphs from the content but it has problems building an appropriate ToUnicode map. This is not a general problem, though, as the parallel test with Arial shows.

Using fonts in custom signatureAppearance signatures with iText7 breaks PDF/A conformance?

I am trying to create signed PDFs from PDF/A-1A input files, the output must preserve the conformance level.
The signatures have to be added with a customized appearance.
If I do it along the code lines below, everything works on the signature side, and the signature is diplayed correctly and verifies OK.
But the PDF/A conformance is broken by the embedded fonts that do not contain the required toUnicode CMAPs.
PdfADocument pdf = ... the doc to be signed
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
PdfReader reader = pdf.getReader();
PrivateKey privateKey = ...
Provider signatureProvider = new BouncyCastleProvider();
Certificate[] signChain = ...
PdfSigner pdfSigner = new PdfSigner(reader, buffer, true);
PdfSignatureAppearance signatureAppearance = pdfSigner.getSignatureAppearance();
signatureAppearance.setReuseAppearance(false);
signatureAppearance.setPageNumber(pdf.getNumberOfPages());
pdfSigner.setFieldName("Custom Signature");
float margin = 35;
Rectangle pageSize = pdf.getLastPage().getMediaBox();
Rectangle signaturePosition = new Rectangle(pageSize.getLeft()+margin,
pageSize.getBottom()+margin,
pageSize.getWidth()-2*margin,
(pageSize.getHeight()-2*margin)/3);
// need to do this before creating any *Canvas object, else the pageRect will be null and the signature invisible
signatureAppearance.setPageRect(signaturePosition);
PdfFont regularFont = PdfFontFactory.createFont("/path/to/truetypefile-regular.ttf", "ISO-8859-1", true);
PdfFont boldFont = PdfFontFactory.createFont("/path/to/truetypefile-bold.ttf", "ISO-8859-1", true);
int fontSize = 10;
PdfFormXObject n0 = signatureAppearance.getLayer0();
PdfCanvas n0Canvas = new PdfCanvas(n0, pdfSigner.getDocument());
PdfFormXObject n2 = signatureAppearance.getLayer2();
Canvas n2Canvas = new Canvas(n2, pdfSigner.getDocument());
if(regularFont != null) {
n2Canvas.setFont(regularFont);
n0Canvas.setFontAndSize(regularFont, fontSize);
}
ImageData imageData = ImageDataFactory.create("/path/to/image.png");
Image image = new Image(imageData);
n2Canvas.add(image);
String layer2Text = ... some lines of text containing newlines and some simple markdown
String[] paragraphs = layer2text.split("\n\n");
for (String text : paragraphs) {
boolean bold = false;
if(text.startsWith("[bold]")) {
bold = true;
text = text.replaceFirst("^\\s*\\[bold\\]\\s*", "");
}
Paragraph p = new Paragraph(text);
p.setFontSize(fontSize);
if(bold) {
p.setFont(boldFont);
}
n2Canvas.add(p);
}
... pdfSigner.setCertificationLevel(PdfSigner.CERTIFIED_FORM_FILLING_AND_ANNOTATIONS);
PrivateKeySignature externalSignature = new PrivateKeySignature(privateKey, DigestAlgorithms.SHA512, signatureProvider.getName());
BouncyCastleDigest externalDigest = new BouncyCastleDigest();
pdfSigner.signDetached(externalDigest, externalSignature, signChain, null, null, null, 0, PdfSigner.CryptoStandard.CMS);
So I assume there is something missing there. The fonts that get embedded to not conform to PDF/A because they miss the ToUnicode CMAP key.
Another error from pdf-tools validator says:
"The value of the key Encoding is Difference but must be WinAnsiEncoding or MacRomanEncoding." Which seems to be the same problem.
The signature itself is OK and visibly, styling and image in place as should be. It's just the fonts that seem to not be OK.

The trigger for the violation of the PDF/A compliance is the way the fonts are created here
PdfFont regularFont = PdfFontFactory.createFont("/path/to/truetypefile-regular.ttf", "ISO-8859-1", true);
PdfFont boldFont = PdfFontFactory.createFont("/path/to/truetypefile-bold.ttf", "ISO-8859-1", true);
or even more specifically the encoding parameter "ISO-8859-1" used therein.
The PDF/A-1 specification requires:
6.3.7 Character encodings
All non-symbolic TrueType fonts shall specify MacRomanEncoding or WinAnsiEncoding as the value of the
Encoding entry in the font dictionary. All symbolic TrueType fonts shall not specify an Encoding entry in the
font dictionary, and their font programs' “cmap” tables shall contain exactly one encoding.
The use of the encoding parameter "ISO-8859-1" caused a neither MacRomanEncoding nor WinAnsiEncoding to be specified as the value of the
Encoding entry in the font dictionary. Instead the value is a dictionary containing only a Differences entry containing an explicit mapping.
Depending on the PDF/A validator this may result in different error messages.
For (as I assume) historic reasons there are a few different encoding parameter values during font creation which cause iText to use WinAnsiEncoding:
""
PdfEncodings.WINANSI (== "Cp1252")
"winansi" (case-insensitive)
"winansiencoding" (case-insensitive)
The OP used PdfName.WinAnsiEncoding.getValue() which returns a string matching the latest option.
While this indicates that iText can be used to properly sign PDF/A documents, a specific PDFASigner class enforcing conformance probably should be introduced.

How to set background color (Page Color) for word document (.doc or .docx) in Java?

By some libraries like http://poi.apache.org , we could create word document with any text color, but for background or highlight of the text, I didn't find any solution.
Page color for word in manual way!:
https://support.office.com/en-us/article/Change-the-background-or-color-of-a-document-6ce0b23e-b833-4421-b8c3-b3d637e62524
Here is my main code to create word document by poi.apache
// Blank Document
#SuppressWarnings("resource")
XWPFDocument document = new XWPFDocument();
// Write the Document in file system
FileOutputStream out = new FileOutputStream(new File(file_address));
// create Paragraph
XWPFParagraph paragraph = document.createParagraph();
paragraph.setAlignment(ParagraphAlignment.RIGHT);
XWPFRun run = paragraph.createRun();
run.setFontFamily(font_name);
run.setFontSize(font_size);
// This only set text color not background!
run.setColor(hex_color);
for (String s : text_array) {
run.setText(s);
run.addCarriageReturn();
}
document.write(out);
out.close();

Update: XWPF is the newest way to create word document files, but setting background only possible by HWPF which is for old format version (.doc)
For *.doc (i.e. POI's HWPF component):
Highlighting of text:
Look into setHighlighted()
Background color:
I suppose you mean the background of a paragraph (AFAIK, Word also allows to color the entire page which is a different matter)
There is setShading() which allows you to provide a foreground and background color (through setCvFore() and setCvBack() of SHDAbstractType) for a Paragraph. IIRC, it is the foreground that you would want to set in order to color your Paragraph. The background is only relevant for shadings which are composed of two (alternating) colors.
The underlying data structure is named Shd80 ([MS-DOC], 2.9.248). There is also SHDOperand ([MS-DOC], 2.9.249) that reflects the functionality of Word prior to Word97. [MS-DOC] is the Binary Word File format specification which is freely available on MSDN.
Edit:
Here is some code to illustrate the above:
try {
HWPFDocument document = [...]; // comes from somewhere
Range range = document.getRange();
// Background shading of a paragraph
ParagraphProperties pprops = new ParagraphProperties();
ShadingDescriptor shd = new ShadingDescriptor();
shd.setCvFore(Colorref.valueOfIco(0x07)); // yellow; ICO
shd.setIpat(0x0001); // solid background; IPAT
pprops.setShading(shd);
Paragraph p1 = range.insertBefore(pprops, StyleSheet.NIL_STYLE);
p1.insertBefore("shaded paragraph");
// Highlighting of individual characters
Paragraph p2 = range.insertBefore(new ParagraphProperties(), StyleSheet.NIL_STYLE);
CharacterRun cr = p2.insertBefore("highlighted text\r");
cr.setHighlighted((byte) 0x06); // red; ICO
document.write([...]); // document goes to somewhere
} catch (IOException e) {
e.printStackTrace();
}
ICO is a color structure
IPAT is a list of predefined shading styles

We Only need to add these 3 lines to set the background color for Word documents by XWPF. We have to set these lines after declaring XWPFRun and it's text color:
CTShd cTShd = run.getCTR().addNewRPr().addNewShd();
cTShd.setVal(STShd.CLEAR);
cTShd.setFill(hex_background_color);

Is it possible to mark a string in pdf document?

I was wondering if it was possible to mark strings in pdf with different color or underline them while looping through the pdf document ?

It's possible on creating a document. Just use different chunks to set the style. Here's an example:
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.add(new Chunk("This word is "));
Chunk underlined = new Chunk("underlined");
underlined.setUnderline(1.0f, -1.0f); //We can customize thickness and position of underline
document.add(underlined);
document.add(new Chunk(". And this phrase has "));
Chunk background = new Chunk("yellow background.");
background.setBackground(BaseColor.YELLOW);
document.add(background);
document.add(Chunk.NEWLINE);
document.close();
However, it's almost impossible to edit an existing PDF document. The author of iText writes in his book:
In a PDF document, every character or glyph on a PDF page has its
fixed position, regardless of the application that’s used to view the
document. This is an advantage, but it also comes with a disadvantage.
Suppose you want to replace the word “edit” with the word “manipulate”
in a sentence, you’d have to reflow the text. You’d have to reposition
all the characters that follow that word. Maybe you’d even have to
move a portion of the text to the next page. That’s not trivial, if
not impossible.
If you want to “edit” a PDF, it’s advised that you change the original
source of the document and remake the PDF.

Aspose.PDF APIs support to create new PDF document and manipulate existing PDF documents without Adobe Acrobat dependency. You can search and add Highlight Annotation to mark PDF text.
REST API Solution using Aspose.PDF Cloud SDK for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf-cloud/aspose-pdf-cloud-java
String name = "02_pages.pdf";
String folder="Temp";
String remotePath=folder+"/"+name;
// File to upload
File file = new File("C:/Temp/"+name);
// Storage name is default storage
String storage = null;
// Get App Key and App SID from https://dashboard.aspose.cloud/
PdfApi pdfApi = new PdfApi("xxxxxxxxxxxxxxxxxxxx", "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx");
//Upload file to cloud storage
pdfApi.uploadFile(remotePath,file,storage);
//Text Position
Rectangle rect = new Rectangle().LLX(259.27580539703365).LLY(743.4707997894287).URX(332.26148873138425).URY(765.5148007965088);
List<AnnotationFlags> flags = new ArrayList<>();
flags.add(AnnotationFlags.DEFAULT);
HighlightAnnotation annotation = new HighlightAnnotation();
annotation.setName("Name Updated");
annotation.rect(rect);
annotation.setFlags(flags);
annotation.setHorizontalAlignment(HorizontalAlignment.CENTER);
annotation.setRichText("Rich Text Updated");
annotation.setSubject("Subj Updated");
annotation.setPageIndex(1);
annotation.setZindex(1);
annotation.setTitle("Title Updated");
annotation.setModified("02/02/2018 00:00:00.000 AM");
List<HighlightAnnotation> annotations = new ArrayList<>();
annotations.add(annotation);
//Add Highlight Annotation to the PDF document
AsposeResponse response = pdfApi.postPageHighlightAnnotations(name,1, annotations, storage, folder);
//Download annotated PDF file from Cloud Storage
File downloadResponse = pdfApi.downloadFile(remotePath, null, null);
File dest = new File("C:/Temp/HighlightAnnotation.pdf");
Files.copy(downloadResponse.toPath(), dest.toPath(), java.nio.file.StandardCopyOption.REPLACE_EXISTING);
System.out.println("Completed......");
On-Premise Solution using Aspose.PDF for Java:
// For complete examples and data files, please go to https://github.com/aspose-pdf/Aspose.Pdf-for-Java
// Instantiate Document object
Document document = new Document("C:/Temp/Test.pdf");
// Create TextFragment Absorber instance to search particular text fragment
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Estoque");
// Iterate through pages of PDF document
for (int i = 1; i <= document.getPages().size(); i++) {
// Get first page of PDF document
Page page = document.getPages().get_Item(i);
page.accept(textFragmentAbsorber);
}
// Create a collection of Absorbed text
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Iterate on above collection
for (int j = 1; j <= textFragmentCollection.size(); j++) {
TextFragment textFragment = textFragmentCollection.get_Item(j);
// Get rectangular dimensions of TextFragment object
Rectangle rect = new Rectangle((float) textFragment.getPosition().getXIndent(), (float) textFragment.getPosition().getYIndent(), (float) textFragment.getPosition().getXIndent() + (float) textFragment.getRectangle().getWidth(), (float) textFragment.getPosition().getYIndent() + (float) textFragment.getRectangle().getHeight());
// Instantiate HighLight Annotation instance
HighlightAnnotation highLight = new HighlightAnnotation(textFragment.getPage(), rect);
// Set opacity for annotation
highLight.setOpacity(.80);
// Set the color of annotation
highLight.setColor(Color.getYellow());
// Add annotation to annotations collection of TextFragment
textFragment.getPage().getAnnotations().add(highLight);
}
// Save updated document
document.save("C:/Temp/HighLight.pdf");
P.S: I work as support/evangelist developer at Aspose.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract bold text from pdf using pdfbox? - java

Related

How to render Colored Text in Apache PDFBox

While parsing pdf with iText7 chars move on fixed interval (with Freeset font)

Using fonts in custom signatureAppearance signatures with iText7 breaks PDF/A conformance?

How to set background color (Page Color) for word document (.doc or .docx) in Java?

Is it possible to mark a string in pdf document?

Categories

Resources