How to get page number which contains particular word in pdf in pdfbox API in java?
I am able to read word with:
PDFTextStripper s = new PDFTextStripper();
String contents = s.getText(pdoc);
if(contents.contains("SUBSCRIPTION DETAILS")){
...
}
But not able to find page number which contains this word
Thanks in advance.
PDFTextStripper allows you to read exact page. So you need to iterate through all pages and check if the page contains certain string:
PDDocument pdoc = ...;
for(int pageNumber = 1; pageNumber < pdoc.getPageCount(); i++){
PDFTextStripper s = new PDFTextStripper();
s.setStartPage(pageNumber);
s.setEndPage(pageNumber);
String pageText = reader.getText(pdoc);
String contents = s.getText(pdoc);
if(contents.contains("SUBSCRIPTION DETAILS")){
...
}
}
Related
I'm trying to read .docx files with styling information using Apache Poi which I have done by looping through each XWPFParagraph and working with all the XWPFRun run inside the paragraphs. Now I want to get contents of each pages. So is there a way to get the contents of each pages or is it possible to know in which page a paragraph is currently in?
This is a function that takes the absolute path of a docx file and returns an array of strings
FileInputStream fis = new FileInputStream(absolutePath);
XWPFDocument document = new XWPFDocument(fis);
List<IBodyElement> bodyElements = document.getBodyElements();
List<String> textList = new ArrayList<>();
/* I want to add some kind of outer loop here for each page
and at the end of that loop I want to add a "<hr/>" tag in the textList
*/
for (IBodyElement bodyElement : bodyElements) { // Looping through paragraphs
if (bodyElement.getElementType() == BodyElementType.PARAGRAPH) {
XWPFParagraph paragraph = (XWPFParagraph) bodyElement;
String textToAdd = parseParagraph(paragraph); //custom funtion to handle paragraphs
textList.add(textToAdd);
}
}
document.close();
return textList.toArray(new String[0]);
As you can see my goal here is to add a <hr/> tag after each page. So, if somehow I can get the page number of a paragraph or loop through pages, I will be able to do that.
Please kindly mention if you know about any other approach that may help.
To get Page Count from XWPFDocument (for your outer loop), you can do something like this:
XWPFDocument docx = new XWPFDocument(POIXMLDocument.openPackage(YOUR_FILE_PATH));
int numOfPages = docx.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
For your paragraph text,
for (XWPFParagraph p : document.getParagraphs()) {
System.out.println(p.getParagraphText()); // YOUR PARAGRAPH TEXT
}
I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.
PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text from pdf:" + text);
} else{
log.info("File is encrypted!");
}
document.close();
Sample:
Sentence 1, nth line of file
Needed line
Sentence 3, n+2th line of file
I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution.
Solution 1:
String[] lines = myString.split(System.getProperty("line.separator"));
Solution 2:
String neededline = (String) FileUtils.readLines(file).get("n+2th")
In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
First, you captures lines in an array like you said.
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
Here's a link to a tutorial that also does what you are trying to do:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...
I am using iText for extraction of data from PDFs. My application is able to read PDFs with English characters, but we found a new file with Chinese characters. When I tried to extract that data, I get an error:
ExceptionConverter: com.itextpdf.text.DocumentException: Font 'STSong-Light' with 'UniGB-UCS2-H' is not recognized.
So I added itext-asian.jar. Now I am not getting an error, but getTextFromPage()
returns an empty string. Am I missing something?
PdfReader pr = new PdfReader(inputPdf);
// get the number of pages in the document
PdfTextExtractor pte =
new PdfTextExtractor(pr, new CustomLocationAwarePdfRenderListener(scanDepth));
int pNum = pr.getNumberOfPages();
String text = "";
// extract text from each page and write it to the output text file
for (int page = 1; page <= pNum; page++) {
text = text.concat("\n").concat(pte.getTextFromPage(page));
}
I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.
You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)
I have two PDF files (named : A1.pdf and B1.pdf). Now I want to replace the some pages of the second PDF file (B1.pdf) with the first one (A1.pdf) programatically. In this case I am using PDFBox library.
Here is my sample code:
try {
File file = new File("/Users/test/Desktop/A1.pdf");
PDDocument pdDoc = PDDocument.load(file);
PDDocument document = PDDocument.load(new File("/Users/test/Desktop/B1.pdf"));
document.removePage(3);
document.addPage((PDPage) pdDoc.getDocumentCatalog().getAllPages().get(0));
document.save("/Users/test/Desktop/"+"generatedPDFBox"+".pdf");
document.close();
}catch(Exception e){}
The idea is to replace the 3rd page. In this implementation the page is appending to the last page of the output pdf. Can anyone help me to implement this? If not with PDFBOX. Could you please suggest some other libraries in java?
This solution creates a third PDF file with the contents like you asked for. Note that pages are zerobased, so the "3" in your question must be a "2".
PDDocument a1doc = PDDocument.load(file1);
PDDocument b1doc = PDDocument.load(file2);
PDDocument resDoc = new PDDocument();
List<PDPage> a1Pages = a1doc.getDocumentCatalog().getAllPages();
List<PDPage> b1Pages = b1doc.getDocumentCatalog().getAllPages();
// replace the 3rd page of the 2nd file with the 1st page of the 1st one
for (int p = 0; p < b1Pages.size(); ++p)
{
if (p == 2)
resDoc.addPage(a1Pages.get(0));
else
resDoc.addPage(b1Pages.get(p));
}
resDoc.save(file3);
a1doc.close();
b1doc.close();
resDoc.close();
If you want to work from the command line instead, look here:
https://pdfbox.apache.org/commandline/
Then use PDFSplit and PDFMerge.
I am not too familiar with how PDFBox works, but to answer your follow up I know you can accomplish what you want to do in a fairly simple manner with the Datalogics APDFL SDK. A free trial exists in case you want to look into it. Here is a code snippet so you can see how it would be done:
Document Doc1 = new Document("/Users/test/Desktop/A1.pdf");
Document Doc2 = new Document("/Users/test/Desktop/B1.pdf");
/* Delete pages on the page range 3-3*/
Doc2.deletePages(3, 3)
/* LastPage is where in Doc2 you want to insert the page, Doc1 the document from which the page is coming from, 0 is the page number in Doc1 that will be inserted first, 1 is the number of pages that will be inserted (beginning from the page number specified in the previous parameter), and PageInsertFlags which would let you customize what gets / doesn't get copied */
Doc2.insertPages(Document.LastPage, Doc1, 0, 1, PageInsertFlags.All);
Doc2.save(EnumSet.of(SaveFlags.FULL), "out.pdf")
Alternatively, there is another method called replacePages which makes the deletion unnecessary. It all depends on what your end goal is, of course.