iText - Modifying PDFWriter# Vertical Alignment manually - java

I've hit upon a problem using iText (java), despite hours of looking thru the docs..
Most of the code I use goes via the Document# API which tracks (via the PDFWriter instance) the current Y position. HOWEVER, we need to use the PdfContentByte part of the API to insert some Java2d into the document, but in doing so this appears to bypass the logic which tracks verticle writes. So next time I use the Document API, it overwrites the contents of the manually inserted things. I want to mimick the behaviour of the Document# API by manually moving the cursor on N number of units (N being the height of the element inserted by the PDFContentByte API), such that when I then use the Document object again, bingo, its cursor is in the correct location. I can see that a method to obtain the cursor exists;
PdfWriter#getVerticalPosition(boolean);
But not one to set it?!

The vertical position returned by PdfWriter, is automatically handled by the writer class when you add paragraph, tables, etc to the document.
If you want to add custom graphics, you have to manually handle the vertical position by saving the position of the last graphic you drawn.
If you have to draw graphic at absolute position, without regard to the text added via Paragraph objects, this is simple.
But if you want sinchronize the position of the graphic with high level objects (Paragraph, pdfTable and so on) you must handle iText events.

Related

Do I have to set the offset or cursor on everyline in PDFBox?

I'm trying to learn PDFBox v2.0 but there seems to be few useful examples and tutorials to get started.
I want to create a simple PDF with text only (no images and fancy stuff) and it looks as the following:
1- Introduction
This is an intro to the document
2- Data
2.1- DataPart1
This is some text here....
2.2- DataPart2
This is another text right here!
3- More Information
Some important informational text here...
I have written the following code:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
firstContentStream.beginText();
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
The text "This is the document intro" appears at the end of the page!
My question: do I have to set the cursor or the offset on every line???!
This seems like hard work. Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
I really don't understand.
And also how do I make it generate new pages whenever the current page is full?
Could you please give one simple example? Thanks in advance
Can't I just do something like:
contentStream.addTextOnNewLine("Some text here...");
First of all please be aware that PDFBox has a very low-level text drawing API only, i.e. most methods you call directly correspond to a single instruction in the PDF content stream. In particular PDFBox does not come with a (publicly accessible) layout'ing functionality but you essentially have to do the layout'ing in your code. This is both a freedom (you are not bound by the limits of an existing machinery) and a burden (you have to do everything yourself).
In your case that means:
Does this mean I have to compute the string width and compare it to the page width and jump to new lines whenever the text fills the entire line?
Yes. Confer the answers #Tilman linked for you
How to generate multiple lines in PDF using Apache pdfbox
Creating a new PDF document using PDFBOX API
and search some more on stackoverflow, there are further answer on different variations of that theme, e.g. for justified text blocks here.
But there is a detail in your code which in your case makes things possibly harder than necessary:
My question: do I have to set the cursor or the offset on every line???!
While you do have to re-position for every new line, it suffices to re-position by offset from the start of the previous line if you still are in the same text object, i.e. if you have not called endText() and beginText() in-between. E.g.:
PDPage firstPage = getNewPage();
PDPageContentStream firstContentStream = new PDPageContentStream(document, firstPage);
firstContentStream.setFont(HEADING_FONT, HEADING_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLUE);
firstContentStream.beginText();
firstContentStream.newLineAtOffset(MARGIN, firstPage.getMediaBox().getHeight() - MARGIN);
firstContentStream.showText("1- Introduction");
// removed firstContentStream.endText();
firstContentStream.setFont(TEXT_FONT, TEXT_FONT_SIZE);
firstContentStream.setNonStrokingColor(Color.BLACK);
// removed firstContentStream.beginText();
firstContentStream.newLineAtOffset(0, -TEXT_FONT_SIZE); // reposition, tightly packed text lines
firstContentStream.showText("This document lists all the impacts that have been observed during the QA validation of the version v3.1 build");
firstContentStream.endText();
firstContentStream.close();
document.addPage(firstPage);
newLineAtOffset(0, -TEXT_FONT_SIZE) sets the text insertion position at the same x coordinate as the start of the previous line and TEXT_FONT_SIZE units lower.
(Using TEXT_FONT_SIZE here will result in pretty tightly set text lines; you may want to use a higher value, e.g. 1.4 times the font size.)
Considering your description you might want to use a positive x offset value for the line after a heading, though.
The text "This is the document intro" appears at the end of the page!
This happens because the PDF instruction generated by firstContentStream.beginText() resets the text insertion position to the origin (0, 0) which by default is the lower left corner.
And also how do I make it generate new pages whenever the current page is full?
Well, it does not happen automatically, you explicitly have to do that when you want to switch to a new page. And you do that more or less exactly like you created your first page.

How to parse untagged pdf file with iText

I want to parse this file (http://www.bbm.ca/_documents/top_30_tv_programs_english/2011/nat01032011.pdf) with iText. The problem is it is not tagged so I can't get the XML file. I decided to extract the text from it and I thought that for example the first line will be like :
1\specialCharWJC:PLAYOFFS CANADA\specialCharTSN+\specialCharM.W....\specialChar19:30\specialChar21:57\specialChar5133
The text I extracted for the first line is
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
I extracted the text using :
PdfReader reader = new PdfReader(filename);
String str = PdfTextExtractor.getTextFromPage(reader, 1);
How the PDf viewer know that CANADA is in the second column not in the third.
My current solution is to convert the pdf file to html5 using http://www.idrsolutions.com/online-pdf-to-html5-converter/ who can determine the text for each column.
Thanks for your response
I wrote the iText text extractor. There are two extraction strategies in iText - one is naive (more proof of concept) that just dumps text as it hits it. The other (LocationTextExtractionStrategy) is much more refined with how it builds strings using the location and font informaton at #Jongware suggests (it also takes all coordinate transformations into account). The latter is the default strategy if you just call getTextFromPage() like you are.
The reason that the row 20 text displays twice is b/c some PDF producers do that to emulate a bold glyph (they shift the characters a tad and re-render). So that is not a bug, really - but certainly could be an opportunity for improvement. There may be something that we could do if we detect chunks of identical content that land within a certain twips zone of each other. The reason we haven't done this already is that this can be REALLY tricky, b/c you might have one chunk that is the entire word, and another set of chunks - one for each letter. We have the ability to do sub-chunk analysis (and in fact this is exposed in the parser interface somewhere - can't recall off hand - let me know if you need it and I'll track it down) - but that would come with a pretty hefty performance penalty, so I'm loathe to do it.
Anyway, the way that I would solve this specific challenge would be to set up physical zones and pass a region filter into the LocationTextExtractionStrategy#getResultantText() call.
If you truly need to insert tab characters (or some column marker) based on the horizontal position of the text, this is quite doable - take a look at where the isChunkAtWordBoundary() method is called in the LocationTextExtractionStrategy source code and add your own handler for inserting special characters beyond a space. It would also be possible to do some sort of contextual analysis (i.e. notice that there are a bunch of chunks that happen to share the same X position and orientation, and designate that X position as a tab stop).
If you come up with an idea that is nice and generic (i.e. not specific to this one parsing task), let me know and I'll see what I can do to incorporate it into iText.
This ...
How the PDf viewer know that CANADA is in the second column not in the third.
is the wrong sort of question -- but the "why" contains hints for a possible solution.
The question is "wrong" because your "PDF viewer" does not know text should be in the second column. There "is" no spoon column in a PDF: all that the viewer gets is a list of (x,y) positions and text to display it on. All it has to do is move a cursor to that (x,y) position and draw the text. See? No columns involved. Not a single [Tab] character either (or any other kind of magic \specialChar, for that matter).
A dumb, straightforward to-text converter scans the input file for text runs and writes them out immediately. It may test for x-positions that are larger than expected, and insert a space when necessary -- in fact, it seems iText does this because inspecting your file shows there is no 'space' character stored between "1" and "WJC:PLAYOFFS CANADA". There is a move to a larger x position on the same y position, so iText infers there is 'something'.
A possible solution is to store all (x,y) coordinates of all text fragments, sort them, and then test whether the end of each text fragment is within a reasonable distance of the start of the next one. (This requires you to retrieve the character widths as well.) If the distance is more or less equal to a space width, you can output a 'space'. If it's more, you can output a [Tab]. The following is the output of a simple PDF reader that does exactly this:
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
2 WJC:PLYOFF CAN PSTGM TSN+ ..W.... 21:54 22:21 3558
3 BIG BANG THEORY CTV Total ...T... 20:00 20:31 3334
-- I aligned the columns manually for clarity, as there was only a single [Tab] between each column. Your document is 'easy', in that every column contains some text. It's ever so slightly harder if it does not (but if necessary, you could create a list of likely tab positions, and test each new text string against that).
In short, you cannot use the plain function getTextFromPage, you need to retrieve correct x and y positions and process them.
Surprising: for some unknown reason the line
20 LAW AND ORDER:SVU CTV Total W 21:00 23:00 1295
is included twice in this document on exactly the same position. I did not anticipate that, and so after sorting, I got this in my output:
20<FONT ArialMT>20 LALAWW ANANDD ORDEORDER:SR:SVUVU CTCTVV TTotalotal ..WW.... 21:0021:00 23:0023:00 1295<FONT Arial-BoldMT>1295
A simpler solution
... would be to manually create a list of "Broadcast Outlets". The list has a fairly predictable format: [digits] [Title] [Outlet] .. (etc.), and only Title and Outlet do not follow a specific pattern. In this list I count just 4 different broadcasters. Parsing the remaining 'columns' should be straightforward.

how read pdf using itext and java and get table cell height

First I have created a pdf using itext and java and put a table and tableCell
PdfPTable table = new PdfPTable(2);
table.setWidths(new int[]{1, 2});
PdfPCell cell;
table.addCell("Address:");
cell = new PdfPCell(new Phrase(""));
cell.setFixedHeight(60);
table.addCell(cell);
I have another Program Which read this pdf File
PdfReader reader = new PdfReader("path_of_previously_created_pdf");
Now i want to get TableCell cell and want to Change cell height cell.setFixedHeight(new_Fixed_Height);
It is possible... if Yes .
How??
Thanx in advance
If your PDF contains just that simple 1x2 table, it of course would be possible to implement something that gives you the PDF with a cell hight of you choice.
But I assume it eventually is meant to contain more. Already the code you provided via your google drive included more (more table cells plus form elements), and that code, too, does look unfinished concerning the PDF construction. Thus,...
The direct answer
It is not possible.
First of all the table and cell objects you have while creating the PDF are not present as such in the resulting file, they merely are drawn as a number of lines and some text (or whatever you put into the cells).
Thus, you cannot even retrieve the cells you want to change, let alone change it.
The twisted answer
You could, of course, try and parse the page content stream to find the commands for drawing lines, find those ones among them which were drawn for the cell you are interested in, and try to derive the original cell dimension attributes from the line coordinates. Afterwards you can attempt to move everything below the cell down to create the extra space you want.
Depending on the information you have (Do you know the approximate position of the cell? If not, do you at least know some unique content of it?) reading the current cell height will include some guesswork and much coding because unfortunately the iText parser framework does not yet support parsing path operations.
Essentially you have to enhance the classes in the PDF parser package to also process and emit events for PDF path operators (if you know your way around in iText and the PDF specification that should not take more than a week or two) and create an appropriate event listener to find the lines surrounding the cell position you already know (not more than one day of work). Some iText code analysis will show how the fixed cell height and the distance of the surrounding lines relate.
Most likely, though, this is the smaller part of your work. The bigger part is actually manipulating the page content:
If you are lucky, all your page content is located in a single content stream. In that case you merely have to analyse all the page content again but this time to actually change it. The easiest way would be to enhance the classes in the parser package once again (because they already do much of the necessary math and book-keeping) to signal every command from the content stream with normalized coordinates (this might take a week or two). Based on this information signaled to you built an all new content stream in which you leave everything above your cell, move down everything below, and stretch everything crossing the line on which the bottom border of your cell lies (another week maybe).
If you are less lucky you have to fight with multiple included form xobjects crossing the line. As those xobjects may be used from other streams also, you cannot change them but have to either change a copy or include the xobject content in your newly created stream.
Then what about images crossing the line? or interesting patterns? In that case stretching the cell will utterly distort everything.
And then there are annotations, e.g. your form fields. You need to shift and stretch them, too.
Thus, while this approach is possible to follow, please be aware that (depending on how generic the solution has to become) its implementation will take someone knowing iText and PDF some months.
An alternative approach
You say in a comment
I am working on Pdf Form.I have created itext form using TextField(MULTILINE TEXT) once. After read this pdf and fill up the form but when the content increases it shows scroll Bar and content hide. My problem is Once i print the pdf it did't print hide content.
Why don't you simply for each set of data build an individual PDF with all the cells big enough for the form contents of the respective data set and copy the field values into this new PDF. This is a fairly simple approach, yet flexible enough to not waste too much space but at the same time not hide content.

Taking advantage of Swing double-buffering

I am writing a program that involves overriding a JPanel to do some graphics work. Basically this JPanel is displaying a field of hexagons and the user can interact with individual cells. This field is composed potentially of hundreds or thousands of hexagons, and most of the time when changes occur, only a tiny portion of the screen is affected. But at the same time, it uses multiple layers (there's a "main" layer that shows what the cells contain, and an overlay that shows which cells are selected and ready to be modified)
In order to avoid the cost of rendering each and every cell any time a repaint is needed, my app as it currently exists holds a BufferedImage for each layer. When paintComponent() is called, the app checks to see if any changes have been made that need to be drawn (for example, a user clicked on the screen so the overlay needs to show that some cell is selected), and if so, renders the modified layers to their bufferedimages. In either case, the the repaint just blits both of the BufferedImages to the screen.
It works just fine, but this is basically a home-spun version of double buffering, which Swing is supposed to support internally. However, for the life of me I can't seem to find where to get access to the component's backing buffer so that I can modify it... If I call "getGraphics()" on the JPanel, I get null pointer exceptions -- So that's a no-go.
I am looking for a more elegant elegant way to do this which hopefully takes advantage of the built-in double buffering offered by swing, in order to avoid blitting the entire screen just for one tiny change (and to avoid having to reinvent clipping).
Any ideas?

Painting boxes in iText PDF

I know that iText PDF supports painting paragraphs which is quite good. However, I was wondering whether (besides using tables) is it possible to do custom drawing of the background of paragraphs, like adding a rounded border box etc.?
I know that there's the VerticalPositionMark interface that allows one to paint on the current y position, however for a paragraph I'd need to know his exact positions to be able to paint into the background first (including any margin from the paragraph etc.).
Is there any way besides using tables for every single box?
thanks!
Alex

Categories