How to create bookmarks -table of contents- from headings with iText? - java

I create pdf from a list of html codes. it generates pdf very well. But I want to add table of contents to the pdf. Table of contents should be created from h1, h2 etc. How can I do it? Below is my function to create pdf. I looked to existing examples in iText site but I couldn't make it work as I want it.
public static void createMultiplePagedPdf(String destinationFile, ArrayList<String> htmlStrings,
String cssLocation, HeaderFooter headerFooter, boolean tableOfContents) {
String css = null;
ElementList list = null;
Document document = new Document();
try {
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(destinationFile));
if(headerFooter!=null)
writer.setPageEvent(headerFooter);
TableOfContent tocEvent = new TableOfContent();
writer.setPageEvent(tocEvent);
document.open();
if(tableOfContents){
tocEvent.setRoot(writer.getRootOutline());
for (TOCEntry entry : tocEvent.getToc()) {
Chunk c = new Chunk(entry.title);
c.setAction(entry.action);
document.add(new Paragraph(c));
}
}
if (cssLocation != null)
css = readCSS(cssLocation);
for (String htmlfile : htmlStrings) {
if (css != null)
list = XMLWorkerHelper.parseToElementList(htmlfile, css);
else
list = XMLWorkerHelper.parseToElementList(htmlfile, null);
for (Element e : list) {
document.add(e);
}
}
System.out.println("Pdf Created successfully");
document.close();
} catch ...
}
tocEvent.getToc() returns an empty list. When I move that if statement to the end of code it doesn't matter.
My TOCEntry and TableOfContent classes are the same as written in the first example of Creating Table of Contents using events Using iText5.
Thanks in advance!

Related

pull data with Jsoup and write into excel file

I am attempting to use jsoup in eclipse to pull data from an html table and write that dat into an excel file. I am able to pull the table data out of the html, but it is all considered one string that is difficult to write to an excel file. I am not sure if this is the correct way to pull the information but this is my current code:
public static void main(String args[]) {
Document document;
try {
document = Jsoup.connect("https://companiesmarketcap.com/usa/largest-companies-in-the-usa-by-market-cap/?page=1/").get();
Elements trs = document.select("td");
for (Element NEW : trs ) {
Elements table = NEW.getElementsByTag("td");
Element td = table.first();
String TA = td.text();
}
}
catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
This gets me my table data, but I am not able to adjust this data to put it into an excel doc. Any help is much appreciated!
You have to be more specific to select the individual elements so you can format them the way you want... Keep studying the Jsoup documentation and CSS selectors... I kept it simple, the code below works so you can copy and paste the result into excel.
public static void main(String[] args) {
try {
StringBuilder sb = new StringBuilder();
Document document = Jsoup.connect("https://companiesmarketcap.com/usa/largest-companies-in-the-usa-by-market-cap/?page=1/").get();
Elements thList = document.select("table > thead > tr > th"); // Get table column headers
for (Element th : thList) {
sb.append(th.text()).append("\t"); // Append table header's text content + tab separator
}
sb.setLength(sb.length() - 1); // To delete the last tab
sb.append("\n"); // Append new line to end of header's content
Elements trList = document.select("table > tbody > tr"); // Get all rows in the table body
for (Element tr : trList) {
Elements tdList = tr.select("td"); // Get column details
for (Element td : tdList) {
sb.append(td.text()).append("\t"); // Append table details' text content + tab separator
}
sb.setLength(sb.length() - 1); // To delete the last tab
sb.append("\n"); // Append new line to end of row's content
}
System.out.println(sb);
} catch (IOException e) {
System.out.println(e.getMessage());
}
}

PDFBox Open PDF file into new browser tab

I am using the pdfbox library 2.0 version. I need to open PDF in new browser tab i.e. Print View.
As if we are migrating from iText to PDFBox below is the existing code with iText.
With below code, there is PDFAction class to achieve same. It is,
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
and to apply print Javascript on doc,
copy.addJavaScript(action);
I need equivalent solution with PDFBox.
Document document = new Document();
try{
outputStream=response.getOutputStream();
// step 2
PdfCopy copy = new PdfCopy(document, outputStream);
// step 3
document.open();
// step 4
PdfReader reader;
int n;
//add print dialog in Pdf Action to open file for preview.
PdfAction action = new PdfAction(PdfAction.PRINTDIALOG);
// loop over the documents you want to concatenate
Iterator i=mergepdfFileList.iterator();
while(i.hasNext()){
File f =new File((String)i.next());
is=new FileInputStream(f);
reader=new PdfReader(is);
n = reader.getNumberOfPages();
for (int page = 0; page < n; ) {
copy.addPage(copy.getImportedPage(reader, ++page));
}
copy.freeReader(reader);
reader.close();
is.close();
}
copy.addJavaScript(action);
// step 5
document.close();
}catch(IOException io){
throw io;
}catch(DocumentException e){
throw e;
}catch(Exception e){
throw e;
}finally{
outputStream.close();
}
I also tried with below reference but could not find print() method of PDDocument type.
Reference Link
Please guide me with this.
This is how file looks when display in browser tab:
This code reproduces what your file has, a JavaScript action in the name tree in the JavaScript entry in the name dictionary in the document catalog. ("When the document is opened, all of the actions in this name tree shall be executed, defining JavaScript functions for use by other scripts in the document" - PDF specification) There's probably an easier way to do this, e.g. with an OpenAction.
PDActionJavaScript javascript = new PDActionJavaScript("this.print(true);\n");
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDDocumentNameDictionary names = new PDDocumentNameDictionary(documentCatalog, new COSDictionary());
PDJavascriptNameTreeNode javascriptNameTreeNode = new PDJavascriptNameTreeNode();
Map<String, PDActionJavaScript> map = new HashMap<>();
map.put("0000000000000000", javascript);
javascriptNameTreeNode.setNames(map);
names.setJavascript(javascriptNameTreeNode);
document.getDocumentCatalog().setNames(names);

Create a New custom COSBase objects with pdfbox?

Can we Create a new custom PDFOperator (like PDFOperator{BDC}) and COSBase objects(like COSName{P} COSName{Prop1} (again Prop1 will reference one more obj)) ? And add these to the root structure of a pdf?
I have read some list of parser tokens from an existing pdf document. I wanted to tag the pdf. In that process I will first manipulate the list of tokens with newly created COSBase objects. At last I will add them to root tree structure. So here how can I create a COSBase objects. I am using the code to extract tokens from pdf is
old_document = PDDocument.load(new File(inputPdfFile));
List<Object> newTokens = new ArrayList<>();
for (PDPage page : old_document.getPages())
{
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<Object> tokens = parser.getTokens();
for (Object token : tokens) {
System.out.println(token);
if (token instanceof Operator) {
Operator op = (Operator) token;
}
}
newTokens.add(token);
}
PDStream newContents = new PDStream(document);
document.addPage(page);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
page.setContents(newContents);
document.save(outputPdfFile);
document.close();
Above code will create a new pdf with all formats and images.
So In newTokens list contains all existing COSBase objects so I wanted to manipulate with some tagging COSBase objects and if I saved the new document then it should be tagged without taking care of any decode, encode, fonts and image handlings.
First Is this idea will work? If yes then help me with some code to create custom COSBase objects. I am very new to java.
Based on your document format you can insert marked content.
//Below code is to add "/p <<MCID 0>> /BDC"
newTokens.add(COSName.getPDFName("P"));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
mcid++;
newTokens.add(currentMarkedContentDictionary);
newTokens.add(Operator.getOperator("BDC"));
// After adding mcid you have to append your existing tokens TJ , TD, Td, T* ....
newTokens.add(existing_token);
// Closed EMC
newTokens.add(Operator.getOperator("EMC"));
//Adding marked content to the root tree structure.
structureElement = new PDStructureElement(StandardStructureTypes.P, currentSection);
structureElement.setPage(page);
PDMarkedContent markedContent = new PDMarkedContent(COSName.P, currentMarkedContentDictionary);
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);
Thanks to #Tilman Hausherr

read docx document using java

I have a project steganography to hide docx document into jpeg image. Using apache POI, I can run it and read docx document but only letters can be read.
Even though there are pictures in it.
Here is the code
FileInputStream in = null;
try
{
in = new FileInputStream(directory);
XWPFDocument datax = new XWPFDocument(in);
XWPFWordExtractor extract = new XWPFWordExtractor(datax);
String DataFinal = extract.getText();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = null;
this.isi_file = extract.getText();
}
catch (IOException x) {}
System.out.println("isi :" + this.isi_file);
How can I read all component in the docx document using java? Please help me and thank you for your helping.
Please check documentation for XWPFDocument class. It contains some useful methods, for example:
getAllPictures() returns list of all pictures in document;
getTables() returns list of all tables in document.
In your code snippet exists line XWPFDocument datax = new XWPFDocument(in);. So after that line your can write some code like:
// process all pictures in document
for (XWPFPictureData picture : datax.getAllPictures()) {
// get each picture as byte array
byte[] pictureData = picture.getData();
// process picture somehow
...
}

How to add a paragraph or text between Tables in .docx file with apache POI

I'm trying to set some paragraph or text in .docx file using Apache POI, I'm reading a .docx file used as template from WEB-INF/resources/templates folder inside my war file, once read, I want to create dynamically more tables starting after 9th table used as template, I'm able to add more tables but other type of content (paragraphs) are arranged in other section of the document ¿is there a required form to do this thing?
XWPFDocument doc = null;
try {
doc = new XWPFDocument(OPCPackage.open(request.getSession().getServletContext().getResourceAsStream("/resources/templates/twd.docx")));
} catch (Exception e) {
e.printStackTrace();
}
XWPFParagraph parrafo = null;
XWPFTable table=null;
org.apache.xmlbeans.XmlCursor cursor = null;
XWPFParagraph newParagraph = null;
XWPFRun run = null;
for(int j=0; j < 3; j++) { //create 3 more tables
table = doc.getTables().get(9);
cursor = table.getCTTbl().newCursor();
cursor.toEndToken();
if (cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START);
{
table = doc.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun()
.setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
table.getRow(0).createCell().addParagraph().createRun().setText("some text");
table.getRow(1).createCell().addParagraph().createRun().setText("some text");
table.getRow(2).createCell().addParagraph().createRun().setText("some text");
table.getRow(3).createCell().addParagraph().createRun().setText("some text");
table.getRows().get(0).getCell(0).setColor("183154");
table.getRows().get(1).getCell(0).setColor("183154");
table.getRows().get(2).getCell(0).setColor("183154");
table.getRows().get(3).getCell(0).setColor("183154");
table.getCTTbl().addNewTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
table.getCTTbl().getTblGrid().addNewGridCol().setW(BigInteger.valueOf(4000));
}
//OTHER CONTENT BETWEEN CREATED TABLES (PARAGRAPHS, BREAK LINES,ETC)
doc.createParagraph().createRun().setText("text after table");
}
If you once uses a cursor, then you must using that cursor for further inserting content if you wants to be in the document part where the cursor is also. Don't believe, the document automatically will take note of the cursor you created.
So for example:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
public class WordTextAfterTable {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordTextAfterTable.docx"));
XWPFTable table = document.getTables().get(9);
org.apache.xmlbeans.XmlCursor cursor = table.getCTTbl().newCursor();
cursor.toEndToken(); //now we are at end of the CTTbl
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
//now we are at next TokenType.START and insert the new table
//note: This is immediately after the table. So both tables touch each other.
table = document.insertNewTbl(cursor);
table.getRow(0).getCell(0).addParagraph().createRun().setText("Name");
table.createRow().getCell(0).addParagraph().createRun().setText("Version");
table.createRow().getCell(0).addParagraph().createRun().setText("Description");
table.createRow().getCell(0).addParagraph().createRun().setText("Comments");
table.createRow().getCell(0).addParagraph().createRun().addCarriageReturn();
//...
System.out.println(cursor.isEnd()); //cursor is now at the end of the new table
//there always must be a next start token after the table. Either a p or at least sectPr.
while(cursor.toNextToken() != org.apache.xmlbeans.XmlCursor.TokenType.START); //we loop over the tokens until next TokenType.START
XWPFParagraph newParagraph = document.insertNewParagraph(cursor);
XWPFRun run = newParagraph.createRun();
run.setText("text after table");
document.write(new FileOutputStream("WordTextAfterTableNew.docx"));
document.close();
}
}

Categories