how read pdf using itext and java and get table cell height

how read pdf using itext and java and get table cell height - java

First I have created a pdf using itext and java and put a table and tableCell
PdfPTable table = new PdfPTable(2);
table.setWidths(new int[]{1, 2});
PdfPCell cell;
table.addCell("Address:");
cell = new PdfPCell(new Phrase(""));
cell.setFixedHeight(60);
table.addCell(cell);
I have another Program Which read this pdf File
PdfReader reader = new PdfReader("path_of_previously_created_pdf");
Now i want to get TableCell cell and want to Change cell height cell.setFixedHeight(new_Fixed_Height);
It is possible... if Yes .
How??
Thanx in advance

If your PDF contains just that simple 1x2 table, it of course would be possible to implement something that gives you the PDF with a cell hight of you choice.
But I assume it eventually is meant to contain more. Already the code you provided via your google drive included more (more table cells plus form elements), and that code, too, does look unfinished concerning the PDF construction. Thus,...
The direct answer
It is not possible.
First of all the table and cell objects you have while creating the PDF are not present as such in the resulting file, they merely are drawn as a number of lines and some text (or whatever you put into the cells).
Thus, you cannot even retrieve the cells you want to change, let alone change it.
The twisted answer
You could, of course, try and parse the page content stream to find the commands for drawing lines, find those ones among them which were drawn for the cell you are interested in, and try to derive the original cell dimension attributes from the line coordinates. Afterwards you can attempt to move everything below the cell down to create the extra space you want.
Depending on the information you have (Do you know the approximate position of the cell? If not, do you at least know some unique content of it?) reading the current cell height will include some guesswork and much coding because unfortunately the iText parser framework does not yet support parsing path operations.
Essentially you have to enhance the classes in the PDF parser package to also process and emit events for PDF path operators (if you know your way around in iText and the PDF specification that should not take more than a week or two) and create an appropriate event listener to find the lines surrounding the cell position you already know (not more than one day of work). Some iText code analysis will show how the fixed cell height and the distance of the surrounding lines relate.
Most likely, though, this is the smaller part of your work. The bigger part is actually manipulating the page content:
If you are lucky, all your page content is located in a single content stream. In that case you merely have to analyse all the page content again but this time to actually change it. The easiest way would be to enhance the classes in the parser package once again (because they already do much of the necessary math and book-keeping) to signal every command from the content stream with normalized coordinates (this might take a week or two). Based on this information signaled to you built an all new content stream in which you leave everything above your cell, move down everything below, and stretch everything crossing the line on which the bottom border of your cell lies (another week maybe).
If you are less lucky you have to fight with multiple included form xobjects crossing the line. As those xobjects may be used from other streams also, you cannot change them but have to either change a copy or include the xobject content in your newly created stream.
Then what about images crossing the line? or interesting patterns? In that case stretching the cell will utterly distort everything.
And then there are annotations, e.g. your form fields. You need to shift and stretch them, too.
Thus, while this approach is possible to follow, please be aware that (depending on how generic the solution has to become) its implementation will take someone knowing iText and PDF some months.
An alternative approach
You say in a comment
I am working on Pdf Form.I have created itext form using TextField(MULTILINE TEXT) once. After read this pdf and fill up the form but when the content increases it shows scroll Bar and content hide. My problem is Once i print the pdf it did't print hide content.
Why don't you simply for each set of data build an individual PDF with all the cells big enough for the form contents of the respective data set and copy the field values into this new PDF. This is a fairly simple approach, yet flexible enough to not waste too much space but at the same time not hide content.

Related

Do I have to set the offset or cursor on everyline in PDFBox?

I'm trying to learn PDFBox v2.0 but there seems to be few useful examples and tutorials to get started.
I want to create a simple PDF with text only (no images and fancy stuff) and it looks as the following:
1- Introduction
This is an intro to the document
2- Data
2.1- DataPart1
This is some text here....
2.2- DataPart2
This is another text right here!
3- More Information
Some important informational text here...
I have written the following code:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
firstContentStream.beginText();
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
The text "This is the document intro" appears at the end of the page!
My question: do I have to set the cursor or the offset on every line???!
This seems like hard work. Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
I really don't understand.
And also how do I make it generate new pages whenever the current page is full?
Could you please give one simple example? Thanks in advance
Can't I just do something like:
contentStream.addTextOnNewLine("Some text here...");

First of all please be aware that PDFBox has a very low-level text drawing API only, i.e. most methods you call directly correspond to a single instruction in the PDF content stream. In particular PDFBox does not come with a (publicly accessible) layout'ing functionality but you essentially have to do the layout'ing in your code. This is both a freedom (you are not bound by the limits of an existing machinery) and a burden (you have to do everything yourself).
In your case that means:
Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
Yes. Confer the answers #Tilman linked for you
How to generate multiple lines in PDF using Apache pdfbox
Creating a new PDF document using PDFBOX API
and search some more on stackoverflow, there are further answer on different variations of that theme, e.g. for justified text blocks here.
But there is a detail in your code which in your case makes things possibly harder than necessary:
My question: do I have to set the cursor or the offset on every line???!
While you do have to re-position for every new line, it suffices to re-position by offset from the start of the previous line if you still are in the same text object, i.e. if you have not called endText() and beginText() in-between. E.g.:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
// removed firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
// removed firstContentStream.beginText();
firstContentStream.newLineAtOffset(0, -TEXT_FONT_SIZE); // reposition, tightly packed text lines
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
newLineAtOffset(0, -TEXT_FONT_SIZE) sets the text insertion position at the same x coordinate as the start of the previous line and TEXT_FONT_SIZE units lower.
(Using TEXT_FONT_SIZE here will result in pretty tightly set text lines; you may want to use a higher value, e.g. 1.4 times the font size.)
Considering your description you might want to use a positive x offset value for the line after a heading, though.
The text "This is the document intro" appears at the end of the page!
This happens because the PDF instruction generated by firstContentStream.beginText() resets the text insertion position to the origin (0, 0) which by default is the lower left corner.
And also how do I make it generate new pages whenever the current page is full?
Well, it does not happen automatically, you explicitly have to do that when you want to switch to a new page. And you do that more or less exactly like you created your first page.

iTextPDF : Set page size of PDF according to the size of the image to be inserted

I am generating PDF report using iTextPDF for Selenium WebDriver scripts developed in TestNG.
The report would contain text block (String) and images. Images always contain a text block before it.
The issue I am facing is that while creating the document, the text block and image blocks are getting displayed in the wrong order of occurrence in the test case. I believe this is because the image to be inserted has size greater than the PDF page.
Consider a scenario where the order of occurrence in the test is as follows
Text Block1
Image1
Text Block2
Text Block3
Image2
'Text Block4'
But the PDF shows as
Text Block1
Image1
Text Block2
Text Block3
Text Block4
Image2
My code is not wrong. I have triple checked it.
No, I cannot post the code because it is huge (>500 lines) and is in my company system.
I want to know if we can create a PDF page and then change its size dynamically when I encounter that the image to be inserted is large.

Your code is not wrong. When an image doesn't fit and there's text that follows the image, adding the image is postponed. You can change this behavior by using the following line:
writer.setStrictImageSequence(true);
In this case writer is your PdfWriter instance.
This solves one problem: the sequence of the text and images will now be correct. However, due to the image size, you will end up with plenty of white space in your document because images that don't fit will trigger a new page.
You could try to solve this by changing the page size. This involves using the setPageSize() method as explained in my answer to this question: iText create document with unequal page sizes
If you want to match page sizes to image sizes, take a look at my answer to this question: Add multiple images into a single pdf file with iText using java
The Image class extends Rectangle and we can use an Image object as a parameter when we create a Document instance, or we can use an Image object when we change the page size:
document.setPageSize(img);
document.newPage();
Important: when you change the page size, the new size will only go into effect on the next page. You can't change the size of the current page (it has already been initialized and changing it after initialization might screw up the content that was already added).
Also: it isn't sufficient for you to change the page size to the size of the image because you're also adding text. You could use ColumnText in simulation mode to find out how much space you need for the text, and then use ColumnText once more to add the text for real after you've created a page with a size that accommodates for the text and the image.
See Can I tell iText how to clip text to fit in a cell and look for the getYLine() method.
I wonder if it wouldn't be easier for you to scale down the images so that they fit the page... Of course: if the size of the images can vary, you'd have the risk that large images would become illegible.
P.S. All the answers I refer to are also available in the free ebook The Best iText Questions on StackOverflow. I bundled hundreds of my answers in this book so that I could easily search for already answered questions when answering a new question.

automating filing out pdfs with iText and java

I'm looking to automate filling out a pdf contract by putting variable text in predefined locations(such as a date or dollar value). I've been trying to wrap my head around iText as a solution but am having trouble actually stamping text onto a pdf. I would really appreciate an example snipped that simply stamps a piece of text on a specific (x,y) coordinate of a pdf file.
If there is a better solution than iText to my problem I would also love to hear other possible solutions.

How to parse untagged pdf file with iText

I want to parse this file (http://www.bbm.ca/_documents/top_30_tv_programs_english/2011/nat01032011.pdf) with iText. The problem is it is not tagged so I can't get the XML file. I decided to extract the text from it and I thought that for example the first line will be like :
1\specialCharWJC:PLAYOFFS CANADA\specialCharTSN+\specialCharM.W....\specialChar19:30\specialChar21:57\specialChar5133
The text I extracted for the first line is
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
I extracted the text using :
PdfReader reader = new PdfReader(filename);
String str = PdfTextExtractor.getTextFromPage(reader, 1);
How the PDf viewer know that CANADA is in the second column not in the third.
My current solution is to convert the pdf file to html5 using http://www.idrsolutions.com/online-pdf-to-html5-converter/ who can determine the text for each column.
Thanks for your response

I wrote the iText text extractor. There are two extraction strategies in iText - one is naive (more proof of concept) that just dumps text as it hits it. The other (LocationTextExtractionStrategy) is much more refined with how it builds strings using the location and font informaton at #Jongware suggests (it also takes all coordinate transformations into account). The latter is the default strategy if you just call getTextFromPage() like you are.
The reason that the row 20 text displays twice is b/c some PDF producers do that to emulate a bold glyph (they shift the characters a tad and re-render). So that is not a bug, really - but certainly could be an opportunity for improvement. There may be something that we could do if we detect chunks of identical content that land within a certain twips zone of each other. The reason we haven't done this already is that this can be REALLY tricky, b/c you might have one chunk that is the entire word, and another set of chunks - one for each letter. We have the ability to do sub-chunk analysis (and in fact this is exposed in the parser interface somewhere - can't recall off hand - let me know if you need it and I'll track it down) - but that would come with a pretty hefty performance penalty, so I'm loathe to do it.
Anyway, the way that I would solve this specific challenge would be to set up physical zones and pass a region filter into the LocationTextExtractionStrategy#getResultantText() call.
If you truly need to insert tab characters (or some column marker) based on the horizontal position of the text, this is quite doable - take a look at where the isChunkAtWordBoundary() method is called in the LocationTextExtractionStrategy source code and add your own handler for inserting special characters beyond a space. It would also be possible to do some sort of contextual analysis (i.e. notice that there are a bunch of chunks that happen to share the same X position and orientation, and designate that X position as a tab stop).
If you come up with an idea that is nice and generic (i.e. not specific to this one parsing task), let me know and I'll see what I can do to incorporate it into iText.

This ...
How the PDf viewer know that CANADA is in the second column not in the third.
is the wrong sort of question -- but the "why" contains hints for a possible solution.
The question is "wrong" because your "PDF viewer" does not know text should be in the second column. There "is" no spoon column in a PDF: all that the viewer gets is a list of (x,y) positions and text to display it on. All it has to do is move a cursor to that (x,y) position and draw the text. See? No columns involved. Not a single [Tab] character either (or any other kind of magic \specialChar, for that matter).
A dumb, straightforward to-text converter scans the input file for text runs and writes them out immediately. It may test for x-positions that are larger than expected, and insert a space when necessary -- in fact, it seems iText does this because inspecting your file shows there is no 'space' character stored between "1" and "WJC:PLAYOFFS CANADA". There is a move to a larger x position on the same y position, so iText infers there is 'something'.
A possible solution is to store all (x,y) coordinates of all text fragments, sort them, and then test whether the end of each text fragment is within a reasonable distance of the start of the next one. (This requires you to retrieve the character widths as well.) If the distance is more or less equal to a space width, you can output a 'space'. If it's more, you can output a [Tab]. The following is the output of a simple PDF reader that does exactly this:
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
2 WJC:PLYOFF CAN PSTGM TSN+ ..W.... 21:54 22:21 3558
3 BIG BANG THEORY CTV Total ...T... 20:00 20:31 3334
-- I aligned the columns manually for clarity, as there was only a single [Tab] between each column. Your document is 'easy', in that every column contains some text. It's ever so slightly harder if it does not (but if necessary, you could create a list of likely tab positions, and test each new text string against that).
In short, you cannot use the plain function getTextFromPage, you need to retrieve correct x and y positions and process them.
Surprising: for some unknown reason the line
20 LAW AND ORDER:SVU CTV Total W 21:00 23:00 1295
is included twice in this document on exactly the same position. I did not anticipate that, and so after sorting, I got this in my output:
20<FONT ArialMT>20 LALAWW ANANDD ORDEORDER:SR:SVUVU CTCTVV TTotalotal ..WW.... 21:0021:00 23:0023:00 1295<FONT Arial-BoldMT>1295
A simpler solution
... would be to manually create a list of "Broadcast Outlets". The list has a fairly predictable format: [digits] [Title] [Outlet] .. (etc.), and only Title and Outlet do not follow a specific pattern. In this list I count just 4 different broadcasters. Parsing the remaining 'columns' should be straightforward.

iText - Modifying PDFWriter# Vertical Alignment manually

I've hit upon a problem using iText (java), despite hours of looking thru the docs..
Most of the code I use goes via the Document# API which tracks (via the PDFWriter instance) the current Y position. HOWEVER, we need to use the PdfContentByte part of the API to insert some Java2d into the document, but in doing so this appears to bypass the logic which tracks verticle writes. So next time I use the Document API, it overwrites the contents of the manually inserted things. I want to mimick the behaviour of the Document# API by manually moving the cursor on N number of units (N being the height of the element inserted by the PDFContentByte API), such that when I then use the Document object again, bingo, its cursor is in the correct location. I can see that a method to obtain the cursor exists;
PdfWriter#getVerticalPosition(boolean);
But not one to set it?!

The vertical position returned by PdfWriter, is automatically handled by the writer class when you add paragraph, tables, etc to the document.
If you want to add custom graphics, you have to manually handle the vertical position by saving the position of the last graphic you drawn.
If you have to draw graphic at absolute position, without regard to the text added via Paragraph objects, this is simple.
But if you want sinchronize the position of the graphic with high level objects (Paragraph, pdfTable and so on) you must handle iText events.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.