iText: Chapters placed into dual column format - java

I am writing a program in Java using iText 5. A big part of the program involves printing things out to PDF, and a big part of that is a human readable dictionary.
Given a chapter that contains phrases of all the words in my dictionary, is there any way to hand this to iText and have it align this in 2 column per page format? Essentially, I'm looking for pages that read left half top to left half bottom->right half top to right half bottom->next page. Anyone familiar with oldschool paper style dictionaries will be familiar with what I'm talking about here, and that's the look that I'm going for.
Sorry that I don't have any example code to go off of here, but I'm kind of stymied, as I haven't been able to find anything that does what I want.

Related

Retrieving text position (as px/in/cm coordinates) in a Word document

I'm using Apache POI to search Word files (doc and docx) to find specified paragraphs and tables. Using various Q/A's from SO and the API, so far this works fine.
What I'd like to do next is to convert the Word file into an image and highlight my search results by drawing boxes around my found paragraphs/tables.
I already wrote the parts, where I draw boxes around text on a PDF and convert those to images (using PDFBox), and I read that Tika will be able to print my Word doc to a PDF.
But I'm totally clueless as how to retrieve the position of my text paragraphs/tables. I've searched the API, but the closest I was able to find was the character position in a paragraph (as in the "i-th character in the paragraph"), which tells me nothing right now about where I'm supposed to start/stop drawing my box.
My "Plan-B" would be to "empty print" all the paragraphs I'm not interested in and only "visibly print" my found part into a PDF and retrieve the coordinates there. But I'd really like to avoid that, since I'm afraid there will be other complications to retrieve the exact position if I change the text appearances.
I don't want to draw the box (or otherwise highlight) the text directly in the word doc, since I'm planning to port the presentation part to a web application (and draw the boxes with <div> or something).
Does anyone have an idea how I can proceed on this or know some place I might find a hint or solution?

automating filing out pdfs with iText and java

I'm looking to automate filling out a pdf contract by putting variable text in predefined locations(such as a date or dollar value). I've been trying to wrap my head around iText as a solution but am having trouble actually stamping text onto a pdf. I would really appreciate an example snipped that simply stamps a piece of text on a specific (x,y) coordinate of a pdf file.
If there is a better solution than iText to my problem I would also love to hear other possible solutions.

How to parse untagged pdf file with iText

I want to parse this file (http://www.bbm.ca/_documents/top_30_tv_programs_english/2011/nat01032011.pdf) with iText. The problem is it is not tagged so I can't get the XML file. I decided to extract the text from it and I thought that for example the first line will be like :
1\specialCharWJC:PLAYOFFS CANADA\specialCharTSN+\specialCharM.W....\specialChar19:30\specialChar21:57\specialChar5133
The text I extracted for the first line is
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
I extracted the text using :
PdfReader reader = new PdfReader(filename);
String str = PdfTextExtractor.getTextFromPage(reader, 1);
How the PDf viewer know that CANADA is in the second column not in the third.
My current solution is to convert the pdf file to html5 using http://www.idrsolutions.com/online-pdf-to-html5-converter/ who can determine the text for each column.
Thanks for your response
I wrote the iText text extractor. There are two extraction strategies in iText - one is naive (more proof of concept) that just dumps text as it hits it. The other (LocationTextExtractionStrategy) is much more refined with how it builds strings using the location and font informaton at #Jongware suggests (it also takes all coordinate transformations into account). The latter is the default strategy if you just call getTextFromPage() like you are.
The reason that the row 20 text displays twice is b/c some PDF producers do that to emulate a bold glyph (they shift the characters a tad and re-render). So that is not a bug, really - but certainly could be an opportunity for improvement. There may be something that we could do if we detect chunks of identical content that land within a certain twips zone of each other. The reason we haven't done this already is that this can be REALLY tricky, b/c you might have one chunk that is the entire word, and another set of chunks - one for each letter. We have the ability to do sub-chunk analysis (and in fact this is exposed in the parser interface somewhere - can't recall off hand - let me know if you need it and I'll track it down) - but that would come with a pretty hefty performance penalty, so I'm loathe to do it.
Anyway, the way that I would solve this specific challenge would be to set up physical zones and pass a region filter into the LocationTextExtractionStrategy#getResultantText() call.
If you truly need to insert tab characters (or some column marker) based on the horizontal position of the text, this is quite doable - take a look at where the isChunkAtWordBoundary() method is called in the LocationTextExtractionStrategy source code and add your own handler for inserting special characters beyond a space. It would also be possible to do some sort of contextual analysis (i.e. notice that there are a bunch of chunks that happen to share the same X position and orientation, and designate that X position as a tab stop).
If you come up with an idea that is nice and generic (i.e. not specific to this one parsing task), let me know and I'll see what I can do to incorporate it into iText.
This ...
How the PDf viewer know that CANADA is in the second column not in the third.
is the wrong sort of question -- but the "why" contains hints for a possible solution.
The question is "wrong" because your "PDF viewer" does not know text should be in the second column. There "is" no spoon column in a PDF: all that the viewer gets is a list of (x,y) positions and text to display it on. All it has to do is move a cursor to that (x,y) position and draw the text. See? No columns involved. Not a single [Tab] character either (or any other kind of magic \specialChar, for that matter).
A dumb, straightforward to-text converter scans the input file for text runs and writes them out immediately. It may test for x-positions that are larger than expected, and insert a space when necessary -- in fact, it seems iText does this because inspecting your file shows there is no 'space' character stored between "1" and "WJC:PLAYOFFS CANADA". There is a move to a larger x position on the same y position, so iText infers there is 'something'.
A possible solution is to store all (x,y) coordinates of all text fragments, sort them, and then test whether the end of each text fragment is within a reasonable distance of the start of the next one. (This requires you to retrieve the character widths as well.) If the distance is more or less equal to a space width, you can output a 'space'. If it's more, you can output a [Tab]. The following is the output of a simple PDF reader that does exactly this:
1 WJC:PLAYOFFS CANADA TSN+ M.W.... 19:30 21:57 5133
2 WJC:PLYOFF CAN PSTGM TSN+ ..W.... 21:54 22:21 3558
3 BIG BANG THEORY CTV Total ...T... 20:00 20:31 3334
-- I aligned the columns manually for clarity, as there was only a single [Tab] between each column. Your document is 'easy', in that every column contains some text. It's ever so slightly harder if it does not (but if necessary, you could create a list of likely tab positions, and test each new text string against that).
In short, you cannot use the plain function getTextFromPage, you need to retrieve correct x and y positions and process them.
Surprising: for some unknown reason the line
20 LAW AND ORDER:SVU CTV Total W 21:00 23:00 1295
is included twice in this document on exactly the same position. I did not anticipate that, and so after sorting, I got this in my output:
20<FONT ArialMT>20 LALAWW ANANDD ORDEORDER:SR:SVUVU CTCTVV TTotalotal ..WW.... 21:0021:00 23:0023:00 1295<FONT Arial-BoldMT>1295
A simpler solution
... would be to manually create a list of "Broadcast Outlets". The list has a fairly predictable format: [digits] [Title] [Outlet] .. (etc.), and only Title and Outlet do not follow a specific pattern. In this list I count just 4 different broadcasters. Parsing the remaining 'columns' should be straightforward.

how read pdf using itext and java and get table cell height

First I have created a pdf using itext and java and put a table and tableCell
PdfPTable table = new PdfPTable(2);
table.setWidths(new int[]{1, 2});
PdfPCell cell;
table.addCell("Address:");
cell = new PdfPCell(new Phrase(""));
cell.setFixedHeight(60);
table.addCell(cell);
I have another Program Which read this pdf File
PdfReader reader = new PdfReader("path_of_previously_created_pdf");
Now i want to get TableCell cell and want to Change cell height cell.setFixedHeight(new_Fixed_Height);
It is possible... if Yes .
How??
Thanx in advance
If your PDF contains just that simple 1x2 table, it of course would be possible to implement something that gives you the PDF with a cell hight of you choice.
But I assume it eventually is meant to contain more. Already the code you provided via your google drive included more (more table cells plus form elements), and that code, too, does look unfinished concerning the PDF construction. Thus,...
The direct answer
It is not possible.
First of all the table and cell objects you have while creating the PDF are not present as such in the resulting file, they merely are drawn as a number of lines and some text (or whatever you put into the cells).
Thus, you cannot even retrieve the cells you want to change, let alone change it.
The twisted answer
You could, of course, try and parse the page content stream to find the commands for drawing lines, find those ones among them which were drawn for the cell you are interested in, and try to derive the original cell dimension attributes from the line coordinates. Afterwards you can attempt to move everything below the cell down to create the extra space you want.
Depending on the information you have (Do you know the approximate position of the cell? If not, do you at least know some unique content of it?) reading the current cell height will include some guesswork and much coding because unfortunately the iText parser framework does not yet support parsing path operations.
Essentially you have to enhance the classes in the PDF parser package to also process and emit events for PDF path operators (if you know your way around in iText and the PDF specification that should not take more than a week or two) and create an appropriate event listener to find the lines surrounding the cell position you already know (not more than one day of work). Some iText code analysis will show how the fixed cell height and the distance of the surrounding lines relate.
Most likely, though, this is the smaller part of your work. The bigger part is actually manipulating the page content:
If you are lucky, all your page content is located in a single content stream. In that case you merely have to analyse all the page content again but this time to actually change it. The easiest way would be to enhance the classes in the parser package once again (because they already do much of the necessary math and book-keeping) to signal every command from the content stream with normalized coordinates (this might take a week or two). Based on this information signaled to you built an all new content stream in which you leave everything above your cell, move down everything below, and stretch everything crossing the line on which the bottom border of your cell lies (another week maybe).
If you are less lucky you have to fight with multiple included form xobjects crossing the line. As those xobjects may be used from other streams also, you cannot change them but have to either change a copy or include the xobject content in your newly created stream.
Then what about images crossing the line? or interesting patterns? In that case stretching the cell will utterly distort everything.
And then there are annotations, e.g. your form fields. You need to shift and stretch them, too.
Thus, while this approach is possible to follow, please be aware that (depending on how generic the solution has to become) its implementation will take someone knowing iText and PDF some months.
An alternative approach
You say in a comment
I am working on Pdf Form.I have created itext form using TextField(MULTILINE TEXT) once. After read this pdf and fill up the form but when the content increases it shows scroll Bar and content hide. My problem is Once i print the pdf it did't print hide content.
Why don't you simply for each set of data build an individual PDF with all the cells big enough for the form contents of the respective data set and copy the field values into this new PDF. This is a fairly simple approach, yet flexible enough to not waste too much space but at the same time not hide content.

iText: Realign Text position on a PDF page

I have a PDF page that Logo, 4 address lines and content. I want to realign the 4 address line to the left about 3 - 4 inches, but keep everything else the same. Is it possible to do this using iText java version?
As far as I know, thats going to be a hard one. Thats not a typical iText-mission. Good for creating PDFs from scratch, not so good at reading or editing. (You can easily add content though).

Categories