I'm using PDFBox to replace text in a template and there are characters (e.g. a simple J) that the tool recognizes as a special character. Any help to solve this problem?
public static PDDocument replaceText(PDDocument document, Map<String, String> mapVars) throws IOException {
for (PDPage page : document.getPages()) {
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List<?> tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (mapVars.containsKey(pstring.trim())) {
String replacement = mapVars.get(pstring.trim());
COSString sx = new COSString(replacement);
sx.setValue(replacement.getBytes());
previous.clear();
previous.add(0, sx);
}
}
}
}
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}
I have a PDF file with colored text that I need to remove. I couldn't find much help anywhere so I dug in and figured it out with the help of this post: PDFBox 2.0 RC3 -- Find and replace text
As I there isn't much about this I suspect that few people care, still, thought I'd share.
private void setTextBlack(PDDocument pdDocument) throws IOException {
for ( PDPage pdPage: pdDocument.getPages()) {
PDFStreamParser parser = new PDFStreamParser(pdPage);
parser.parse();
java.util.List tokens = parser.getTokens();
for ( int i=0; i<tokens.size(); i++ ) {
Object next = tokens.get(i);
if ( next instanceof Operator && ((Operator) next).getName().equals("BT") ) {
for ( int j=i+1; j< tokens.size(); j++ ) {
Object btToken = tokens.get(j);
if ( btToken instanceof Operator && ((Operator) btToken).getName().equals("rg") ) {
int n = j - 1;
while (tokens.get(n) instanceof COSInteger || tokens.get(n) instanceof COSFloat) {
tokens.set(n, new COSFloat(0f));
n--;
}
}
if ( btToken instanceof Operator && ((Operator) btToken).getName().equals("ET")) {
break;
}
}
}
}
PDStream updatedStream = new PDStream(pdDocument);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
pdPage.setContents(updatedStream);
out.close();
}
}
I have 4 pdf files that came from one .doc file and I use 4 methods to convert my doc to a pdf (foxite reader, nitro, webservice and Word).
Then I used pdfbox to search and replace some words. The problem is, for some reason it only works for the file from foxite reader and Word, but not for the files created by nitro and the webservice.
Can any one have a clue?
This is the code I used:
public static void replace(String s) {
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load(s); // Input PDF File Name
System.out.println("+e" + doc);
List pages = doc.getDocumentCatalog()
.getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
// System.out.println("ddd");
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
// System.out.println("jjjj");
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation()
.equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("#signature#")) {
string = string.replace("#signature#", "sam");
occurrences++;
}
// Word you want to change.
// Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation()
.equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("#signature#")) {
tempString = tempString.replace("#signature#", "sam");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save(s + "_convert.pdf"); // Output file name
} catch (Exception ex) {
System.out.println("eee+" + ex.getMessage());
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
ex.getStackTrace();
}
}
}
}
I'm trying to replace text in pdf and it's kind of replaced, this is my code
PDDocument doc = null;
int occurrences = 0;
try {
doc = PDDocument.load("test.pdf"); //Input PDF File Name
List pages = doc.getDocumentCatalog().getAllPages();
for (int i = 0; i < pages.size(); i++) {
PDPage page = (PDPage) pages.get(i);
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream());
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof PDFOperator) {
PDFOperator op = (PDFOperator) next;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getOperation().equals("Tj")) {
// Tj takes one operator and that is the string
// to display so lets update that operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
if (string.contains("Good")) {
string = string.replace("Good", "Bad");
occurrences++;
}
//Word you want to change. Currently this code changes word "Good" to "Bad"
previous.reset();
previous.append(string.getBytes("ISO-8859-1"));
} else if (op.getOperation().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
COSString temp = new COSString();
String tempString = "";
for (int t = 0; t < previous.size(); t++) {
if (previous.get(t) instanceof COSString) {
tempString += ((COSString) previous.get(t)).getString();
}
}
temp.append(tempString.getBytes("ISO-8859-1"));
tempString = "";
tempString = temp.getString();
if (tempString.contains("Good")) {
tempString = tempString.replace("Good", "Bad");
occurrences++;
}
previous.clear();
String[] stringArray = tempString.split(" ");
for (String string : stringArray) {
COSString cosString = new COSString();
string = string + " ";
cosString.append(string.getBytes("ISO-8859-1"));
previous.add(cosString);
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
page.setContents(updatedStream);
}
System.out.println("number of matches found: " + occurrences);
doc.save("a.pdf"); //Output file name
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} catch (COSVisitorException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
} finally {
if (doc != null) {
try {
doc.close();
} catch (IOException ex) {
Logger.getLogger(ReplaceTextInPDF.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
the issue that it's replaced in a bad characters or hidden shape ( as example the bad word becomes only d character), but if i copy and paste it in another place it paste the expected word correctly,
also when i search the generated pdf for the new word it doesn't find it, but when i search with the old word it finds it in the replaced places
I found aspose, this link shows how to use it to replace text in pdfs, it's easy and works perfect except that it's not free, so the free version is printing copyrights line on the head of pdf file pages
http://www.aspose.com/docs/display/pdfjava/Replace+Text+in+Pages+of+a+PDF+Document
After reading endless documents and trying to understand the examples about opencv/javacv for extracting keypoints, computing features with some DescriptorExtractors to match an input image against bunch of images to see if the input image is one of them or part of one of those images, I think, we should be storing the Mat objects after computing them.
I will use Emily Webb's code as an example:
String smallUrl = "rsz_our-mobile-planet-us-infographic_infographics_lg_unberela.jpg";
String largeUrl = "our-mobile-planet-us-infographic_infographics_lg.jpg";
IplImage image = cvLoadImage(largeUrl,CV_LOAD_IMAGE_UNCHANGED );
IplImage image2 = cvLoadImage(smallUrl,CV_LOAD_IMAGE_UNCHANGED );
CvMat descriptorsA = new CvMat(null);
CvMat descriptorsB = new CvMat(null);
final FastFeatureDetector ffd = new FastFeatureDetector(40, true);
final KeyPoint keyPoints = new KeyPoint();
final KeyPoint keyPoints2 = new KeyPoint();
ffd.detect(image, keyPoints, null);
ffd.detect(image2, keyPoints2, null);
System.out.println("keyPoints.size() : "+keyPoints.size());
System.out.println("keyPoints2.size() : "+keyPoints2.size());
// BRISK extractor = new BRISK();
//BriefDescriptorExtractor extractor = new BriefDescriptorExtractor();
FREAK extractor = new FREAK();
extractor.compute(image, keyPoints, descriptorsA);
extractor.compute(image2, keyPoints2, descriptorsB);
System.out.println("descriptorsA.size() : "+descriptorsA.size());
System.out.println("descriptorsB.size() : "+descriptorsB.size());
DMatch dmatch = new DMatch();
//FlannBasedMatcher matcher = new FlannBasedMatcher();
//DescriptorMatcher matcher = new DescriptorMatcher();
BFMatcher matcher = new BFMatcher();
matcher.match(descriptorsA, descriptorsB, dmatch, null);
System.out.println(dmatch.capacity());
My question is :
How can I store descriptorsA (or descriptorsB) in a DB --in java implementation of opencv- ? (They are Mat objects obtained after extractor.compute(image, keyPoints, descriptorsA); )
I am aware of the fact that Mat objects are not serializable objects in java implementation but surely, if you want to match an image against a set of archive images, you have to extract the descriptors of your archive and store them some where for feature use..
After some more search I have found some links in http://answers.opencv.org/question/8873/best-way-to-store-a-mat-object-in-android/
Although the answers are mainly for android devices and referring to earlier questions about saving keypoints ( Saving ORB feature vectors using OpenCV4Android (java API)), the answer "from Mat object to xml and xml to Mat object" in the code below seems to be working:
import org.opencv.core.CvType;
import org.opencv.core.Mat;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.File;
import java.util.Locale;
import java.util.Scanner;
public class TaFileStorage {
// static
public static final int READ = 0;
public static final int WRITE = 1;
// varaible
private File file;
private boolean isWrite;
private Document doc;
private Element rootElement;
public TaFileStorage() {
file = null;
isWrite = false;
doc = null;
rootElement = null;
}
// read or write
public void open(String filePath, int flags ) {
try {
if( flags == READ ) {
open(filePath);
}
else {
create(filePath);
}
} catch(Exception e) {
e.printStackTrace();
}
}
// read only
public void open(String filePath) {
try {
file = new File(filePath);
if( file == null || file.isFile() == false ) {
System.err.println("Can not open file: " + filePath );
}
else {
isWrite = false;
doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
doc.getDocumentElement().normalize();
}
} catch(Exception e) {
e.printStackTrace();
}
}
// write only
public void create(String filePath) {
try {
file = new File(filePath);
if( file == null ) {
System.err.println("Can not wrtie file: " + filePath );
}
else {
isWrite = true;
doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
rootElement = doc.createElement("opencv_storage");
doc.appendChild(rootElement);
}
} catch(Exception e) {
e.printStackTrace();
}
}
public Mat readMat(String tag) {
if( isWrite ) {
System.err.println("Try read from file with write flags");
return null;
}
NodeList nodelist = doc.getElementsByTagName(tag);
Mat readMat = null;
for( int i = 0 ; i<nodelist.getLength() ; i++ ) {
Node node = nodelist.item(i);
if( node.getNodeType() == Node.ELEMENT_NODE ) {
Element element = (Element)node;
String type_id = element.getAttribute("type_id");
if( "opencv-matrix".equals(type_id) == false) {
System.out.println("Fault type_id ");
}
String rowsStr = element.getElementsByTagName("rows").item(0).getTextContent();
String colsStr = element.getElementsByTagName("cols").item(0).getTextContent();
String dtStr = element.getElementsByTagName("dt").item(0).getTextContent();
String dataStr = element.getElementsByTagName("data").item(0).getTextContent();
int rows = Integer.parseInt(rowsStr);
int cols = Integer.parseInt(colsStr);
int type = CvType.CV_8U;
Scanner s = new Scanner(dataStr);
s.useLocale(Locale.US);
if( "f".equals(dtStr) ) {
type = CvType.CV_32F;
readMat = new Mat( rows, cols, type );
float fs[] = new float[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
if( s.hasNextFloat() ) {
fs[0] = s.nextFloat();
}
else {
fs[0] = 0;
System.err.println("Unmatched number of float value at rows="+r + " cols="+c);
}
readMat.put(r, c, fs);
}
}
}
else if( "i".equals(dtStr) ) {
type = CvType.CV_32S;
readMat = new Mat( rows, cols, type );
int is[] = new int[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
if( s.hasNextInt() ) {
is[0] = s.nextInt();
}
else {
is[0] = 0;
System.err.println("Unmatched number of int value at rows="+r + " cols="+c);
}
readMat.put(r, c, is);
}
}
}
else if( "s".equals(dtStr) ) {
type = CvType.CV_16S;
readMat = new Mat( rows, cols, type );
short ss[] = new short[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
if( s.hasNextShort() ) {
ss[0] = s.nextShort();
}
else {
ss[0] = 0;
System.err.println("Unmatched number of int value at rows="+r + " cols="+c);
}
readMat.put(r, c, ss);
}
}
}
else if( "b".equals(dtStr) ) {
readMat = new Mat( rows, cols, type );
byte bs[] = new byte[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
if( s.hasNextByte() ) {
bs[0] = s.nextByte();
}
else {
bs[0] = 0;
System.err.println("Unmatched number of byte value at rows="+r + " cols="+c);
}
readMat.put(r, c, bs);
}
}
}
}
}
return readMat;
}
public void writeMat(String tag, Mat mat) {
try {
if( isWrite == false) {
System.err.println("Try write to file with no write flags");
return;
}
Element matrix = doc.createElement(tag);
matrix.setAttribute("type_id", "opencv-matrix");
rootElement.appendChild(matrix);
Element rows = doc.createElement("rows");
rows.appendChild( doc.createTextNode( String.valueOf(mat.rows()) ));
Element cols = doc.createElement("cols");
cols.appendChild( doc.createTextNode( String.valueOf(mat.cols()) ));
Element dt = doc.createElement("dt");
String dtStr;
int type = mat.type();
if(type == CvType.CV_32F ) { // type == CvType.CV_32FC1
dtStr = "f";
}
else if( type == CvType.CV_32S ) { // type == CvType.CV_32SC1
dtStr = "i";
}
else if( type == CvType.CV_16S ) { // type == CvType.CV_16SC1
dtStr = "s";
}
else if( type == CvType.CV_8U ){ // type == CvType.CV_8UC1
dtStr = "b";
}
else {
dtStr = "unknown";
}
dt.appendChild( doc.createTextNode( dtStr ));
Element data = doc.createElement("data");
String dataStr = dataStringBuilder( mat );
data.appendChild( doc.createTextNode( dataStr ));
// append all to matrix
matrix.appendChild( rows );
matrix.appendChild( cols );
matrix.appendChild( dt );
matrix.appendChild( data );
} catch(Exception e) {
e.printStackTrace();
}
}
private String dataStringBuilder(Mat mat) {
StringBuilder sb = new StringBuilder();
int rows = mat.rows();
int cols = mat.cols();
int type = mat.type();
if( type == CvType.CV_32F ) {
float fs[] = new float[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
mat.get(r, c, fs);
sb.append( String.valueOf(fs[0]));
sb.append( ' ' );
}
sb.append( '\n' );
}
}
else if( type == CvType.CV_32S ) {
int is[] = new int[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
mat.get(r, c, is);
sb.append( String.valueOf(is[0]));
sb.append( ' ' );
}
sb.append( '\n' );
}
}
else if( type == CvType.CV_16S ) {
short ss[] = new short[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
mat.get(r, c, ss);
sb.append( String.valueOf(ss[0]));
sb.append( ' ' );
}
sb.append( '\n' );
}
}
else if( type == CvType.CV_8U ) {
byte bs[] = new byte[1];
for( int r=0 ; r<rows ; r++ ) {
for( int c=0 ; c<cols ; c++ ) {
mat.get(r, c, bs);
sb.append( String.valueOf(bs[0]));
sb.append( ' ' );
}
sb.append( '\n' );
}
}
else {
sb.append("unknown type\n");
}
return sb.toString();
}
public void release() {
try {
if( isWrite == false) {
System.err.println("Try release of file with no write flags");
return;
}
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(file);
// write to xml file
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
// do it
transformer.transform(source, result);
} catch(Exception e) {
e.printStackTrace();
}
}
}
As the code proposed by Thorben was to slow in my case, I came up with the following code using serialization.
public final void saveMat(String path, Mat mat) {
File file = new File(path).getAbsoluteFile();
file.getParentFile().mkdirs();
try {
int cols = mat.cols();
float[] data = new float[(int) mat.total() * mat.channels()];
mat.get(0, 0, data);
try (ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(path))) {
oos.writeObject(cols);
oos.writeObject(data);
oos.close();
}
} catch (IOException | ClassCastException ex) {
System.err.println("ERROR: Could not save mat to file: " + path);
Logger.getLogger(this.class.getName()).log(Level.SEVERE, null, ex);
}
}
public final Mat loadMat(String path) {
try {
int cols;
float[] data;
try (ObjectInputStream ois = new ObjectInputStream(new FileInputStream(path))) {
cols = (int) ois.readObject();
data = (float[]) ois.readObject();
}
Mat mat = new Mat(data.length / cols, cols, CvType.CV_32F);
mat.put(0, 0, data);
return mat;
} catch (IOException | ClassNotFoundException | ClassCastException ex) {
System.err.println("ERROR: Could not load mat from file: " + path);
Logger.getLogger(this.class.getName()).log(Level.SEVERE, null, ex);
}
return null;
}
For descriptors you OpenCV uses Mats of floats, in other cases you have to modify the code accordingly to this list found here:
CV_8U and CV_8S -> byte[]
CV_16U and CV_16S -> short[]
CV_32S -> int[]
CV_32F -> float[]
CV_64F-> double[]
After search all of the answers,i edit some code and it seems work.I use it to store the Sift Descriptor into HBase.
public static byte[] serializeMat(Mat mat) {
ByteArrayOutputStream bos = new ByteArrayOutputStream();
try {
float[] data = new float[(int) mat.total() * mat.channels()];
mat.get(0, 0, data);
ObjectOutput out = new ObjectOutputStream(bos);
out.writeObject(data);
out.close();
// Get the bytes of the serialized object
byte[] buf = bos.toByteArray();
return buf;
} catch (IOException ioe) {
ioe.printStackTrace();
return null;
}
}