Bad characters when replacing text in pdf using pdfbox

Bad characters when replacing text in pdf using pdfbox - java

I'm trying to replace text in pdf and it's kind of replaced, this is my code
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load("test.pdf"); //Input PDF File Name
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation().equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("Good")) {
string = string.replace("Good", "Bad");
occurrences++;
}
//Word you want to change. Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("Good")) {
tempString = tempString.replace("Good", "Bad");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save("a.pdf"); //Output file name
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} catch (COSVisitorException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
the issue that it's replaced in a bad characters or hidden shape ( as example the bad word becomes only d character), but if i copy and paste it in another place it paste the expected word correctly,
also when i search the generated pdf for the new word it doesn't find it, but when i search with the old word it finds it in the replaced places

I found aspose, this link shows how to use it to replace text in pdfs, it's easy and works perfect except that it's not free, so the free version is printing copyrights line on the head of pdf file pages
http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document

Related

How to keep the order of words while reading a right to left text pdf

I am trying to parse a text from a pdf file
(with right to left language)
using java (code below)
sometimes because it's a right to
left language -
the order of the words
changes after my try to break the lines.
For example:
טלפון: טלפון1 דואר:דואר1
Became:
דואר1 : דואר טלפון1 טלפון:
public void test(){
PDFParser parser = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
PDFTextStripper pdfStripper;
String parsedText = "";
try {
parser = new PDFParser(new RandomAccessFile(new File(file1), "r"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdfStripper.setSortByPosition(true);
//separator
pdfStripper.setWordSeparator(" ");
pdDoc = new PDDocument(cosDoc);
//get count of pages
int pages = pdDoc.getPages().getCount();
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
parsedText = parsedText + pdfStripper.getText(pdDoc);
if(pages>1){
//
}
StringTokenizer lines = new StringTokenizer(parsedText, "\n");
return lines.getTokenList();
} catch (){
}
}

Try to use simple invert:
public String invert(String s){
String arr[] = s.split(" ");
int len = arr.length;
for (int i = 0; i < len / 2; i++) {
String temp = arr[i];
arr[i] = arr[len - i - 1];
arr[len - i - 1] = temp;
}
return Arrays.stream(arr)
.collect(Collectors.joining(" "));
}
Using example:
System.out.println(invert("1 2 3 4 5");
Result:
5 4 3 2 1
Also, you should be considering another delimiter signs (enter, tabulation, comma...)

replace a text using pdfbox for PDF file

I have 4 pdf files that came from one .doc file and I use 4 methods to convert my doc to a pdf (foxite reader, nitro, webservice and Word).
Then I used pdfbox to search and replace some words. The problem is, for some reason it only works for the file from foxite reader and Word, but not for the files created by nitro and the webservice.
Can any one have a clue?
This is the code I used:
public static void replace(String s) {
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load(s); // Input PDF File Name
System.out.println("+e" + doc);
List pages = doc.getDocumentCatalog()
.getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
// System.out.println("ddd");
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
// System.out.println("jjjj");
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation()
.equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("#signature#")) {
string = string.replace("#signature#", "sam");
occurrences++;
}
// Word you want to change.
// Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation()
.equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("#signature#")) {
tempString = tempString.replace("#signature#", "sam");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save(s + "_convert.pdf"); // Output file name
} catch (Exception ex) {
System.out.println("eee+" + ex.getMessage());
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
ex.getStackTrace();
}
}
}
}

Unable to open file using File f = new File(file_pth);

I am trying to open a file by drag and drop onto JTextField but i always get the error.
Heres my code
public void drop(DropTargetDropEvent dtde) {
String str4=null;
try {
JTextArea comp = null;
if(Switchtab==2)
comp=textarea1;
if(Switchtab==3)
comp=textarea2;
if(Switchtab==4)
comp=textarea3;
if(Switchtab==1)
comp=textarea4;
// Ok, get the dropped object and try to figure out what it is
Transferable tr = dtde.getTransferable();
DataFlavor[] flavors = tr.getTransferDataFlavors();
for (int i = 0; i < flavors.length; i++) {
System.out.println("Possible flavor: "
+ flavors[i].getMimeType());
// Check for file lists specifically
if (flavors[i].isFlavorJavaFileListType()) {
// Great! Accept copy drops...
dtde.acceptDrop(DnDConstants.ACTION_COPY);
// comp.setText("Successful file list drop.\n\n");
// And add the list of file names to our text area
java.util.List list = (java.util.List) tr
.getTransferData(flavors[i]);
for (int j = 0; j < list.size(); j++) {
//wcomp.append(list.get(j) + "\n");
str4=list.get(j)+"\n";
}
// Replace '\' with '/'
file_pth = str4.replaceAll("\\\\","/" );
System.out.println(str4.replaceAll("\\\\","/" ));
//Open the file
try {
File f = new File(file_pth);
FileInputStream fobj = new FileInputStream(f);
int len = (int) f.length();
str4 = "";
for (int j = 0; j < len; j++) {
char str5 = (char) fobj.read();
str4 = str4 + str5;
}
comp.setText(str4);
setTitle(str4);
} catch (Exception e) {
System.out.println("Caught::" + e);
}
// If we made it this far, everything worked.
dtde.dropComplete(true);
return;
}
}
// Hmm, the user must not have dropped a file list
System.out.println("Drop failed: " + dtde);
dtde.rejectDrop();
} catch (Exception e) {
e.printStackTrace();
dtde.rejectDrop();
}
}
I even tried replacing backslash with double backslash and forward slash but still i get this error
Possible flavor: application/x-java-file-list; class=java.util.List
C:/kevin_java/file io/DemoIO.java
Caught::java.io.FileNotFoundException: C:\kevin_java\file io\DemoIO.java
(The filename, directory name, or volume label syntax is incorrect)
The output doesnt show the replaced string.
It shows the previous string with single backslash.

finally i got my answer.
Simple solution
java.util.List list = (java.util.List) tr
.getTransferData(flavors[i]);
for (int j = 0; j < list.size(); j++) {
str4=list.get(j).toString();
}
File f = new File(str4);
FileInputStream fobj = new FileInputStream(f);
...
...
..

Edit
From the javadoc for isFlavorJavaFileListType,
Returns true if the DataFlavor specified represents a list of file objects.
Therefor,
FileInputStream fobj = new FileInputStream(list.get(list.length()-1));

How to replace centered text in a PDF with PDFBox

I use the PDFTextReplacement example.
It does the replacement as expected, In case my text is left aligned.
But if my input pdf has a text centered, it replaces the text as a left aligned.
Ok, so I have to recalculate the right starting point.
For that reason I have two targets or questions:
How to determine the alignment?
How to calculate the right starting point?
Here is my code:
public PDDocument doIt(String inputFile, Map<String, String> text)
throws IOException, COSVisitorException {
// the document
PDDocument doc = null;
doc = PDDocument.load(inputFile);
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display
// strings in a PDF
String pstring = "";
int prej = 0;
if (op.getOperation().equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
// System.out.println(j + " " + string);
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
cosString.reset();
// cosString.append(string
// .getBytes("ISO-8859-1"));
}
}
COSString cosString2 = (COSString) previous
.getObject(0);
for (int t = 1; t < previous.size(); t++)
previous.remove(t);
// cosString2.setNeedToBeUpdate(true);
if (text.containsKey(pstring.trim())) {
String textValue = text.get(pstring.trim());
cosString2.append(textValue.getBytes("ISO-8859-1"));
for (int k = 1; k < previous.size(); k++) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the
// page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
return doc;
}

you can use this function:
public void doIt( String inputFile, String outputFile, String strToFind, String message)
throws IOException, COSVisitorException
{
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes() );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes() );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}

How to read bookmarks in PDF using itext at multi level?

I am using iText-Java to split PDFs at bookmark level.
Does anybody know or have any examples for splitting a PDF at bookmarks that exist at a level 2 or 3?
For ex: I have the bookmarks in the following levels:
Father
|-Son
|-Son
|-Daughter
|-|-Grand son
|-|-Grand daughter
Right now I have below code to read the bookmark which reads the base bookmark(Father). Basically SimpleBookmark.getBookmark(reader) line did all the work.
But I want to read the level 2 and level 3 bookmarks to split the content present between those inner level bookmarks.
public static void splitPDFByBookmarks(String pdf, String outputFolder){
try
{
PdfReader reader = new PdfReader(pdf);
//List of bookmarks: each bookmark is a map with values for title, page, etc
List<HashMap> bookmarks = SimpleBookmark.getBookmark(reader);
for(int i=0; i<bookmarks.size(); i++){
HashMap bm = bookmarks.get(i);
HashMap nextBM = i==bookmarks.size()-1 ? null : bookmarks.get(i+1);
//In my case I needed to split the title string
String title = ((String)bm.get("Title")).split(" ")[2];
log.debug("Titel: " + title);
String startPage = ((String)bm.get("Page")).split(" ")[0];
String startPageNextBM = nextBM==null ? "" + (reader.getNumberOfPages() + 1) : ((String)nextBM.get("Page")).split(" ")[0];
log.debug("Page: " + startPage);
log.debug("------------------");
extractBookmarkToPDF(reader, Integer.valueOf(startPage), Integer.valueOf(startPageNextBM), title + ".pdf",outputFolder);
}
}
catch (IOException e)
{
log.error(e.getMessage());
}
}
private static void extractBookmarkToPDF(PdfReader reader, int pageFrom, int pageTo, String outputName, String outputFolder){
Document document = new Document();
OutputStream os = null;
try{
os = new FileOutputStream(outputFolder + outputName);
// Create a writer for the outputstream
PdfWriter writer = PdfWriter.getInstance(document, os);
document.open();
PdfContentByte cb = writer.getDirectContent(); // Holds the PDF data
PdfImportedPage page;
while(pageFrom < pageTo) {
document.newPage();
page = writer.getImportedPage(reader, pageFrom);
cb.addTemplate(page, 0, 0);
pageFrom++;
}
os.flush();
document.close();
os.close();
}catch(Exception ex){
log.error(ex.getMessage());
}finally {
if (document.isOpen())
document.close();
try {
if (os != null)
os.close();
} catch (IOException ioe) {
log.error(ioe.getMessage());
}
}
}
Your help is much appreciated.
Thanks in advance! :)

You get an ArrayList<HashMap> when you call SimpleBookmark.getBookmark(reader); (do the cast if you need it). Try to iterate through that Arraylist and see its structure. If a bookmarks have sons (as you call it), it will contains another list with the same structure.
A recursive method could be the solution.

Reference for those who are looking at this using itext7
public void walkOutlines(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) { //----------loop traversing all paths
for (PdfOutline child : outline.getAllChildren()){
if(child.getDestination() != null) {
prepareIndexFile(child,names,pdfDocument,titles,pageNum,list);
}
}
}
//-----Getting pageNumbers from outlines
public void prepareIndexFile(PdfOutline outline, Map<String, PdfObject> names, PdfDocument pdfDocument,List<String>titles,List<Integer>pageNum) {
String title = outline.getTitle();
PdfDestination pdfDestination = outline.getDestination();
String pdfStr = ((PdfString)pdfDestination.getPdfObject()).toUnicodeString();
PdfArray array = (PdfArray) names.get(pdfStr);
PdfObject pdfObj = array != null ? array.get(0) : null;
Integer pageNumber = pdfDocument.getPageNumber((PdfDictionary)pdfObj);
titles.add(title);
pageNum.add(pageNumber);
if(outline.getAllChildren().size() > 0) {
for (PdfOutline child : outline.getAllChildren()){
prepareIndexFile(child,names,pdfDocument,titles,pageNum);
}
}
}
public boolean splitPdf(String inputFile, final String outputFolder) {
boolean splitSuccess = true;
PdfDocument pdfDoc = null;
try {
PdfReader pdfReaderNew = new PdfReader(inputFile);
pdfDoc = new PdfDocument(pdfReaderNew);
final List<String> titles = new ArrayList<String>();
List<Integer> pageNum = new ArrayList<Integer>();
PdfNameTree destsTree = pdfDoc.getCatalog().getNameTree(PdfName.Dests);
Map<String, PdfObject> names = destsTree.getNames();//--------------------------------------Core logic for getting names
PdfOutline root = pdfDoc.getOutlines(false);//--------------------------------------Core logic for getting outlines
walkOutlines(root,names, pdfDoc, titles, pageNum,content); //------Logic to get bookmarks and pageNumbers
if (titles == null || titles.size()==0) {
splitSuccess = false;
}else { //------Proceed if it has bookmarks
for(int i=0;i<titles.size();i++) {
String title = titles.get(i);
String startPageNmStr =""+pageNum.get(i);
int startPage = Integer.parseInt(startPageNmStr);
int endPage = startPage;
if(i == titles.size() - 1) {
endPage = pdfDoc.getNumberOfPages();
}else {
int nextPage = pageNum.get(i+1);
if(nextPage > startPage) {
endPage = nextPage - 1;
}else {
endPage = nextPage;
}
}
String outFileName = outputFolder + File.separator + getFileName(title) + ".pdf";
PdfWriter pdfWriter = new PdfWriter(outFileName);
PdfDocument newDocument = new PdfDocument(pdfWriter, new DocumentProperties().setEventCountingMetaInfo(null));
pdfDoc.copyPagesTo(startPage, endPage, newDocument);
newDocument.close();
pdfWriter.close();
}
}
}catch(Exception e){
//---log
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Bad characters when replacing text in pdf using pdfbox - java

I found aspose, this link shows how to use it to replace text in pdfs, it's easy and works perfect except that it's not free, so the free version is printing copyrights line on the head of pdf file pages http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document

Related

How to keep the order of words while reading a right to left text pdf

replace a text using pdfbox for PDF file

Unable to open file using File f = new File(file_pth);

How to replace centered text in a PDF with PDFBox

How to read bookmarks in PDF using itext at multi level?

Categories

Resources