I want to add a paragraph, containing HTML, to a document.
As far as I know, iText only supports adding HTML to a document directly via XMLWorkerHelper.
Furthermore I want to change the font of the HTML, but this can be done with a css-file.
My approach is similar to this code:
XMLWorkerHelper worker = XMLWorkerHelper.getInstance();
worker.parseXHtml(pdfWriter, document, fis);
But this solution is writing to the document directly. I want to add the HTML to a paragraph, so I may add some additional formatting to that section.
String html = "<p>Html code here</p>";
Paragraph comb = new Paragraph();
StringBuilder sb = new StringBuilder();
sb.append(html);
ElementList list = XMLWorkerHelper.parseToElementList(sb.toString(), null);
for (Element element : list) {
comb.add(element);
}
para = new Paragraph(comb);
cell = new PdfPCell(para);
cell.setHorizontalAlignment(Element.ALIGN_LEFT);
cell.setBorder(Rectangle.NO_BORDER);
cell.setPaddingTop(0);
cell.setPaddingBottom(15f);
cell.setLeading(3f, 1.2f);
table.addCell(cell);
Go to parsing HTML step by step. In that example, the final pipeline is a PdfWriterPipeline which isn't what you want (because this pipeline writes stuff to the document). You want to replace this final pipeline with an ElementHandlerPipeline, converting all the HTML tags that are encountered to an ElementList.
Once you have this list of Element instances, it's up to you to decide what to do with it (adding them to a Paragraph is one option).
Related
I'm trying to read .docx files with styling information using Apache Poi which I have done by looping through each XWPFParagraph and working with all the XWPFRun run inside the paragraphs. Now I want to get contents of each pages. So is there a way to get the contents of each pages or is it possible to know in which page a paragraph is currently in?
This is a function that takes the absolute path of a docx file and returns an array of strings
FileInputStream fis = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fis);
List<IBodyElement> bodyElements = document.getBodyElements();
List<String> textList = new ArrayList<>();
/* I want to add some kind of outer loop here for each page
and at the end of that loop I want to add a "<hr/>" tag in the textList
*/
for (IBodyElement bodyElement : bodyElements) { // Looping through paragraphs
if (bodyElement.getElementType() == BodyElementType.PARAGRAPH) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
String textToAdd = parseParagraph(paragraph); //custom funtion to handle paragraphs
textList.add(textToAdd);
}
}
document.close();
return textList.toArray(new String[0]);
As you can see my goal here is to add a <hr/> tag after each page. So, if somehow I can get the page number of a paragraph or loop through pages, I will be able to do that.
Please kindly mention if you know about any other approach that may help.
To get Page Count from XWPFDocument (for your outer loop), you can do something like this:
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(YOUR_FILE_PATH));
int numOfPages = docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
For your paragraph text,
for (XWPFParagraph p : document.getParagraphs()) {
System.out.println(p.getParagraphText()); // YOUR PARAGRAPH TEXT
}
I have a pdf template which contains images and form fields.
I read this template and fill form fields per page and write to a temp pdf file. Then I read this file and copy to a master document to have multiple pages using the same template. Roughly as below:
Document masterDoc = ...
-- loop per page --
PdfWriter pfdWriter = new PdfWriter(tmpFileName);
PdfDocument pdf = new PdfDocument(new PdfReader(templateFile), pfdWriter);
Document doc = new Document(pdf);
// Set form fields
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
form.setDefaultJustification(0);
Map<String, PdfFormField> formFields = form.getFormFields();
formFields.get("key").setValue("value");
form.flattenFields();
doc.close();
try (PdfDocument resource = new PdfDocument(new PdfReader("pathToTmpFile"))) {
resource.copyPagesTo(1, 1, masterDoc.getPdfDocument());
}
-- end of loop
This approach is slow (depends on the template file size, but takes seconds not milliseconds).
Would it be possible to use same template file per every page without writing and reading from/to temp files?
I read the documentation and guess it might be possible with new page event handler but couldn't figure it out.
Can we Create a new custom PDFOperator (like PDFOperator{BDC}) and COSBase objects(like COSName{P} COSName{Prop1} (again Prop1 will reference one more obj)) ? And add these to the root structure of a pdf?
I have read some list of parser tokens from an existing pdf document. I wanted to tag the pdf. In that process I will first manipulate the list of tokens with newly created COSBase objects. At last I will add them to root tree structure. So here how can I create a COSBase objects. I am using the code to extract tokens from pdf is
old_document = PDDocument.load(new File(inputPdfFile));
List<Object> newTokens = new ArrayList<>();
for (PDPage page : old_document.getPages())
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
for (Object token : tokens) {
System.out.println(token);
if (token instanceof Operator) {
Operator op = (Operator) token;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
document.addPage(page);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
document.save(outputPdfFile);
document.close();
Above code will create a new pdf with all formats and images.
So In newTokens list contains all existing COSBase objects so I wanted to manipulate with some tagging COSBase objects and if I saved the new document then it should be tagged without taking care of any decode, encode, fonts and image handlings.
First Is this idea will work? If yes then help me with some code to create custom COSBase objects. I am very new to java.
Based on your document format you can insert marked content.
//Below code is to add "/p <<MCID 0>> /BDC"
newTokens.add(COSName.getPDFName("P"));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
mcid++;
newTokens.add(currentMarkedContentDictionary);
newTokens.add(Operator.getOperator("BDC"));
// After adding mcid you have to append your existing tokens TJ , TD, Td, T* ....
newTokens.add(existing_token);
// Closed EMC
newTokens.add(Operator.getOperator("EMC"));
//Adding marked content to the root tree structure.
structureElement = new PDStructureElement(StandardStructureTypes.P, currentSection);
structureElement.setPage(page);
PDMarkedContent markedContent = new PDMarkedContent(COSName.P, currentMarkedContentDictionary);
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);
Thanks to #Tilman Hausherr
I am trying to save all of the readable words on a web page into one text document while ignoring html markup.
Using JSoup to parse all of the words on a webpage, my only guess of how to seperate the real words from the code is through elements.
Is it possible to convert multiple elements of the jsoup document into a text file?
i.e.:
Elements titles = doc.select("title");
Elements paragraphs = doc.select("p");
Elements links = doc.select("a[href]");
Elements smallText = doc.select("a");
Currently saving the parse as a document with:
Document doc = Jsoup.connect("https:// (enter a url)").get();
Its simple way
Document doc = Jsoup.connect("https:// (enter a url)").get();
BufferedWriter writer = null;
try
{
writer = new BufferedWriter( new FileWriter("d://test.txt"));
writer.write(doc.toString());
}
catch ( IOException e)
{
}
Adding answer because I am unable to comment above.
Replace writer.write(doc.toString()); by writer.write(doc.select("html").text()); in the above code.
It will give you the text on the page.
Instead of "html" in doc.select("**html**").text() other tags can be used to extract text enclosed in those tags.
Edit: you can also use writer.write(doc.body().text());
After writing in the text with writer.write(doc.text()); the very next line you need to write writer.close(); this will fix the problem.
I have two PDF files (named : A1.pdf and B1.pdf). Now I want to replace the some pages of the second PDF file (B1.pdf) with the first one (A1.pdf) programatically. In this case I am using PDFBox library.
Here is my sample code:
try {
File file = new File("/Users/test/Desktop/A1.pdf");
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = PDDocument.load(new File("/Users/test/Desktop/B1.pdf"));
document.removePage(3);
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0));
document.save("/Users/test/Desktop/"+"generatedPDFBox"+".pdf");
document.close();
}catch(Exception e){}
The idea is to replace the 3rd page. In this implementation the page is appending to the last page of the output pdf. Can anyone help me to implement this? If not with PDFBOX. Could you please suggest some other libraries in java?
This solution creates a third PDF file with the contents like you asked for. Note that pages are zerobased, so the "3" in your question must be a "2".
PDDocument a1doc = PDDocument.load(file1);
PDDocument b1doc = PDDocument.load(file2);
PDDocument resDoc = new PDDocument();
List<PDPage> a1Pages = a1doc.getDocumentCatalog().getAllPages();
List<PDPage> b1Pages = b1doc.getDocumentCatalog().getAllPages();
// replace the 3rd page of the 2nd file with the 1st page of the 1st one
for (int p = 0; p < b1Pages.size(); ++p)
{
if (p == 2)
resDoc.addPage(a1Pages.get(0));
else
resDoc.addPage(b1Pages.get(p));
}
resDoc.save(file3);
a1doc.close();
b1doc.close();
resDoc.close();
If you want to work from the command line instead, look here:
https://pdfbox.apache.org/commandline/
Then use PDFSplit and PDFMerge.
I am not too familiar with how PDFBox works, but to answer your follow up I know you can accomplish what you want to do in a fairly simple manner with the Datalogics APDFL SDK. A free trial exists in case you want to look into it. Here is a code snippet so you can see how it would be done:
Document Doc1 = new Document("/Users/test/Desktop/A1.pdf");
Document Doc2 = new Document("/Users/test/Desktop/B1.pdf");
/* Delete pages on the page range 3-3*/
Doc2.deletePages(3, 3)
/* LastPage is where in Doc2 you want to insert the page, Doc1 the document from which the page is coming from, 0 is the page number in Doc1 that will be inserted first, 1 is the number of pages that will be inserted (beginning from the page number specified in the previous parameter), and PageInsertFlags which would let you customize what gets / doesn't get copied */
Doc2.insertPages(Document.LastPage, Doc1, 0, 1, PageInsertFlags.All);
Doc2.save(EnumSet.of(SaveFlags.FULL), "out.pdf")
Alternatively, there is another method called replacePages which makes the deletion unnecessary. It all depends on what your end goal is, of course.