how can I write java code with syntax highlighting using pdfbox - java

I am trying to create a java project that simply prints a project source code, be it java, php, c++ and others.
I can create the PDF just fine with iText, but now I need some kind of highlighting the java code I read the same way a code editor like sublime highlights. I discovered pdfbox: a library for creating/manipulating PDF files, but I can't find how to highlight code text(like sublime does) by using this library. Any help?

Copying from another SO question : highlight text using pdfbox when it's location in the pdf is known
PDDocument doc = PDDocument.load(/*path to the file*/);
PDPage page = (PDPage)doc.getDocumentCatalog.getAllPages.get(i);
List annots = page.getAnnotations;
PDAnnotationTextMarkup markup = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.Su....);
markup.setRectangle(/*your PDRectangle*/);
markup.setQuads(/*float array of size eight with all the vertices of the PDRectangle in anticlockwise order*/);
annots.add(markup);
doc.save(/*path to the output file*/);

Related

Save as print to pdf option using Java 8

I am watermarking pdf document using itext7 library. it is preserving layers and shows one of its signature invalid.
I want to flatten the created document.
When i tried saving the document manually using Adobe print option, it flattens all signature and makes the document as valid document. Same functionality i want with java program.
Is there any way using java program, we can flatten pdf document?
According to your tag selection you appear to be using iText 7 for Java.
How to flatten a PDF AcroForm form using iText 7 is explained in the iText 7 knowledge base example Flattening a form. The pivotal code is:
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST));
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
// If no fields have been explicitly included, then all fields are flattened.
// Otherwise only the included fields are flattened.
form.flattenFields();
pdfDoc.close();
(https://kb.itextpdf.com/home/it7kb/examples/flattening-a-form visited 2021-05-24)

PDFBox 2.0.3 Set cropBox using TextPosition coordinates

I've located a region of interest in the page by tracking TextPosition objects using PDFTextStripper as shown in the example: https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java
As shown, the TextPosition has been retrieved from fields like
text.getXDirAdj(), text.getWidthDirAdj(), text.getYDirAdj(), text.getHeightDir() .
From this example I tried to keep everything else the same except setting the cropBox of the target page.
https://github.com/apache/pdfbox/blob/2.0.3/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java
OLD CROPBOX: [0.0,0.0,595.276,841.89] -> NEW CROPBOX [50.0,42.0,592.0,642.0].
So how can I use the getYDirAdj and getXDirAdj in setting the cropbox correctly ?
The original pdf file I'm processing can be downloaded from here: http://downloadcenter.samsung.com/content/UM/201504/20150407095631744/ENG-US_NMATSCJ-1.103-0330.pdf
Cropping the page
In a comment the OP reduced his problem to
Ok. Given a java PDRectangle rect = new PDRectangle(40f, 680f, 510f, 100f) obtained from TextLocation how would a java code snippet, that sets the cropBox of a single page look like ? Or how would you do it? TextLocation based rect --> some transformation --> setCropBox(theRightBox).
To set the crop box of the page twelve of the given document to the given PDRectangle you can use code like this:
PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));
pdDocument.save(new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.pdf"));
(SetCropBox.java test method testSetCropBoxENG_US_NMATSCJ_1_103_0330)
Adobe Reader now shows merely this part of page twelve:
Beware, though, the page in question does not only specify a media box (mandatory) and a crop box, it also defines a bleed box and an art box. Thus, application which consider those boxes more interesting than the crop box, might display the page differently. In particular the art box (being defined as "the extent of the page’s meaningful content") might by some applications be considered important.
Rendering the cropped page
In a comment to this answer the OP remarked
This is good and works. It correctly saves the page in the PDF file. I've tried to do the same in JPG and failed.
I reduced the OP's code to the essentials
PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));
PDFRenderer renderer = new PDFRenderer(pdDocument);
BufferedImage img = renderer.renderImage(12 - 1, 4f);
ImageIOUtil.writeImage(img, new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.jpg").getAbsolutePath(), 300);
pdDocument.close();
(SetCropBox.java test method testSetCropBoxImgENG_US_NMATSCJ_1_103_0330)
The result:
Thus, I cannot reproduce an issue here.
Possible details to check for:
ImageIOUtil is not part of the main PDFBox artifact, instead it is located in pdfbox-tools; does the version of that artifact match the version of the core pdfbox artifact?
I run the code in an Oracle Java 8 environment; other Java environments might give rise to different results.
There are minor differences in our implementations. E.g. I load the PDF via an InputStream, you directly from file system, I have hardcoded the page number, you have it in some variable, ... None of these differences should cause your problem, but who knows...

pdf form fields empty after filling them with itext on android

In my android app I fill the formfields from a pdffile, using itextg like this:
PdfReader reader = new PdfReader(this.templateFile);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(this.targetFile));
AcroFields form = stamper.getAcroFields();
for (String key : values.keySet()) {
form.setField(key, values.get(key));
}
stamper.setFormFlattening(true);
stamper.close();
I can see, that the value of the formfields are actually set, when debugging and inspecting the stamper. But as soon I open the targetFile all of my fields are empty.
If I do not flatten my form, the values remain in the fields, which makes me believe the value would also be there in the flattend pdf but simply not displayed.
Btw, using FormFiller form the itext demos (http://itextpdf.com/itext-android-demos) the same pdf works really fine!
This could be caused by different things.
Not the correct iTextG version
See Appearance issues with pdf interactive forms using iText where you'll find this answer:
This seems to be a bug on some versions of iText. I had the same problem with iTextSharp version 5.5.5 and it was solved after I upgraded to version 5.5.9.
The form doesn't know it has to generate the appearances
See Editable .pdf fields disappear (but visible on field focus) after save with evince where the problem is solved by changing the appearance setting:
form.put(PdfName.NEEDAPPEARANCES, PdfBoolean.PDFTRUE);
Or see iText 5.5 fails to fill form where iText is instructed to create the appearances:
af.setGenerateAppearances(true);
I would start with af.setGenerateAppearances(true); first.

How to detect OCR in a scanned Document with pdfbox 2.0.0?

The Problem: I have a large folder with many subfolders with many pdfs in them. Some of them already have OCR on them. Some of them don't. So i wanted to write a Java Program to filter the non OCR PDFs out and copy them to a hot folder.
I tested like 20 Documents and what they all have in common is, that if you open them with editor, you can find the word 'font' and the OCR ones and you cant find it in the non OCR ones. My Question now is: How do i implement this check using PDFbox 2.0.0 ? All the solutions i found dont seem to work only with older versions. And I'm not capable of finding a solution in the documentation. (which is clearly my fault)
Thanks in advance.
Here's how to find out if fonts are on the top level of a page:
PDDocument doc = PDDocument.load(new File(...));
PDPage page = doc.getPage(0); // 0 based
PDResources resources = page.getResources();
for (COSName fontName : resources.getFontNames())
{
System.out.println(fontName.getName());
}
doc.close();
Re: mkl suggestion, here's how to extract text:
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1); // 1 based
stripper.setEndPage(1);
String extractedText = stripper.getText(doc);
System.out.println(extractedText);

Wrong parsing with iText's PdfTextExtractor

I'm facing a problem when I try to read the content of a PDF document. I'm using iText 2.1.7 with Java, and I need to analyze the content of a PDF document: at first I was using the PdfTextExtractor's getTextFromPage method and it was working right, but only when the page is just text, if it contains an image, then the String that I get with the getTextFromPage is a set of meaningless symbols (maybe a different character encoding?), and I lose the content of the whole page. I tried with the last version of iText and works fine, but if I'm not wrong the license wouldn't be totally free (I'm working in a web application for a commercial customer, which serves PDFs on the fly) so I can't use it. I would really appreciate if you have any suggestion.
In case you need it, here is the code:
PdfReader pdf = new PdfReader(doc); //doc is just a byte[]
int pageCount = pdf.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdf);
String pageText = pdfTextExtractor.getTextFromPage(i);
Thanks in advance, regards.
I think that you PDF has an inline image. I do not think that iText 2.1.7 will deal with that.
You can find information regarding the license here

Categories