Any difference in content extracted by pdfbox and itext

Any difference in content extracted by pdfbox and itext - java

I am evaluating pdfbox-1.8.6 and iText-4.2.0(free) for performance. I have noticed that content extraction is faster in iText, but searching words using regex in the content extracted by iText takes longer time than pdfbox.
Extracted content and its size seems same in both cases.
Can anybody explain me on this?
My environment is ubuntu 12.04/java 1.7.
EDIT: adding sample codes.
pdfbox
//in constructor
fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
pdfDocument = PDDocument.load(bis);
textStripper = new PDFTextStripper();
//in extarctContent method
textStripper.setStartPage(pageNo);
textStripper.setEndPage(pageNo);
return textStripper.getText(pdfDocument);
iText
//in constructor
fis = new FileInputStream(file);
reader = new PdfReader(fis);
pdfTextExtractor = new PdfTextExtractor(reader);
//in extarctContent method
return pdfTextExtractor.getTextFromPage(pageNo);
searching words
StopWatch searchTime = ...
StopWatch pdfTime = ...
for (int i = 1; i <= pageCount; i++) {
// fetch one page
pdfTime.resume();
String pageContent = pdfParserService.extractPageContent(i);
pdfTime.suspend();
if (pageContent == null) {
pageContent = "";
}
pageContent = pageContent.replace("-\n", "").replace("\n", "").replace("\\", " ");
searchTime.resume();
Collection list = searchService.search(pageContent, wordList);
searchTime.suspend();
}
pdfTime.stop();
searchTime.stop();
System.out.println(pdfTime.getTime());
System.out.println(searchTime.getTime());

Related

How to replace text in a pdf with correct encoding using Itext

I create a java program for translating PDFs. I am using google API for translation. I am getting the translation correct on my Eclipse IDE Console but when I check the newly created pdf, either it's not translated and copied as it is or few words are translated or the new pdf comes as empty and sometimes corrupted.
I suppose it has something to do with encoding & font types.
I have already gone through the Itext page & all the related questions but none worked for my case. I am trying to translate Portuguese Spanish Finnish French Hungarian, etc into English.
Here is my code:
public static final String SRC = "5587309Finnish.pdf";
public static final String DEST = "changed.pdf";
public static void main(String[] args) throws java.io.IOException, DocumentException {
Translate translate = TranslateOptions.getDefaultInstance().getService();
PdfReader reader = new PdfReader(SRC);
int pages = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
for(int i=1;i<=pages;i++) {
PdfDictionary dict = reader.getPageN(i);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
String pageContent =
PdfTextExtractor.getTextFromPage(reader, i);
String[] word = pageContent.split(" ");
PRStream stream = (PRStream) object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data, BaseFont.CP1252);
for (int j=0; j < word.length; j++)
{
Translation translation = translate.translate(word[j],Translate.TranslateOption.sourceLanguage("fi"),
Translate.TranslateOption.targetLanguage("en"));
System.out.println(word[j]+"-->>"+translation.getTranslatedText());//here i can check the translation is correct.
dd = dd.replace(word[j],translation.getTranslatedText());
}
stream.setData(dd.getBytes());
}
}
stamper.close();
reader.close();
}
Please help.

According to a comment you have improved your code and are
getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf
Thus, I assume that your (hopefully representative) test PDFs have all their fonts of interest encoded in ANSI'ish encodings and the text arguments of the text drawing instructions contain whole words or even phrases which can properly be processed because otherwise text replacement would not have been possible.
Thus, here an example how one can replace text pieces with similarly long ones under such benign circumstances without breaking the content stream syntax. In this example I simply use a Map containing replacement strings. You can do your translation there.
First a frame loading the source, creating a stamper, iterating over the pages, and calling a helper to create a content stream replacement:
Map<String, String> replacements = new HashMap<>();
replacements.put("Förfallodatum", "Ablaufdatum");
try ( InputStream resource = SOURCE_INPUTSTREAM;
OutputStream result = new FileOutputStream(RESULT_FILE) ) {
PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
for (int pageNum = 1; pageNum <= pdfReader.getNumberOfPages(); pageNum++) {
PdfDictionary page = pdfReader.getPageN(pageNum);
byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(pdfReader, pageNum);
page.remove(PdfName.CONTENTS);
replaceInStringArguments(pageContentInput, pdfStamper.getUnderContent(pageNum), replacements);
}
pdfStamper.close();
}
(EditPageContentSimple test testReplaceInStringArgumentsForklaringAvFakturan)
The method replaceInStringArguments now parses the instructions in the given content stream, isolates string arguments, and calls another helper for each string argument doing the replacement.
void replaceInStringArguments(byte[] contentBytesBefore, PdfContentByte canvas, Map<String, String> replacements) throws IOException {
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytesBefore)));
PdfContentParser ps = new PdfContentParser(tokeniser);
ArrayList<PdfObject> operands = new ArrayList<PdfObject>();
while (ps.parse(operands).size() > 0){
for (int i = 0; i < operands.size(); i++) {
PdfObject pdfObject = operands.get(i);
if (pdfObject instanceof PdfString) {
operands.set(i, replaceInString((PdfString)pdfObject, replacements));
} else if (pdfObject instanceof PdfArray) {
PdfArray pdfArray = (PdfArray) pdfObject;
for (int j = 0; j < pdfArray.size(); j++) {
PdfObject arrayObject = pdfArray.getPdfObject(j);
if (arrayObject instanceof PdfString) {
pdfArray.set(j, replaceInString((PdfString)arrayObject, replacements));
}
}
}
}
for (PdfObject object : operands)
{
object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append((byte) ' ');
}
canvas.getInternalBuffer().append((byte) '\n');
}
}
(EditPageContentSimple helper method)
The method replaceInString in turn retrieves a single string operand (a PdfString instance), manipulates it, and returns the manipulated string version:
PdfString replaceInString(PdfString string, Map<String, String> replacements) {
String value = PdfEncodings.convertToString(string.getBytes(), PdfObject.TEXT_PDFDOCENCODING);
for (Map.Entry<String, String> entry : replacements.entrySet()) {
value = value.replace(entry.getKey(), entry.getValue());
}
return new PdfString(PdfEncodings.convertToBytes(value, PdfObject.TEXT_PDFDOCENCODING));
}
(EditPageContentSimple helper method)
Instead of that for loop here you would call your translation routine and translate value.
As has been mentioned before, this code only works under certain benign circumstances. Don't expect it to work for arbitrary documents from the wild, in particular not for documents with other than Western European glyphs.

java.net.malformedURL exception

URL stringfile = getXsl("test.xml");
File originFile = new File(stringfile.getFile());
String xml = null;
ByteArrayOutputStream pdfStream = null;
try {
FileInputStream fis = new FileInputStream(originFile);
int length = fis.available();
byte[] readData = new byte[length];
fis.read(readData);
xml = (new String(readData)).trim();
fis.close();
xml = xml.substring(xml.lastIndexOf("<HttpCommandList>")+17, xml.lastIndexOf("</HttpCommandList>"));
String[] splitxml = xml.split("</HttpCommand>");
for (int i = 0; i < splitxml.length; i++) {
tmpxml = splitxml[i].trim() + "</HttpCommand>";
System.out.println("splitxml:" +tmpxml);
pdfStream = new ByteArrayOutputStream();
pdf = new com.lowagie.text.Document();
PdfWriter.getInstance(pdf, pdfStream);
pdf.open();
URL xslToUse = getXsl("test.xsl");
// Here am using some utility class to transform
// generate the XML needed by iText to generate the PDF using MessageBuffer contents
String iTextXml = XmlUtil.transformXml(tmpxml.toString(), xslToUse).trim();
// generate the PDF document by parsing the specified XML file
XmlParser.parse(pdf, new ByteArrayInputStream(iTextXml.getBytes()));
}
For the above code, during the XmlParser am getting java.net.malformedURL exception : no protocol
Am trying to generate the pdf document by parsing the specified xml file.

We could need the actual xml-file to decide what is missing. I expect, that there is no protocol defined, just like this:
192.168.1.2/ (no protocol)
file://192.168.1.2/ (there is one)
And URL seems to need one.
Also try:
new File("somexsl.xlt").toURI().toURL();
See here and here.
It always helps spoilering the complete stacktrace. No one knows, where the exception actually occured, if you dont post the line numbers.

How to give tamil word docx to java to make PDF

I am using xDoc report to generate PDF by giving the docx file as input. everything is fine when I used English docx file, when I used my other language docx file I couldn't get the pdf as readable.
here is my code..
File fil = new File(
"/home/madurauser/analyzer/LOS/DocxProjectWithVelocity1.docx");
FileInputStream in = new FileInputStream(fil);
IXDocReport report = XDocReportRegistry.getRegistry().loadReport(
in, TemplateEngineKind.Velocity);
FieldsMetadata metadata = new FieldsMetadata();
metadata.addFieldAsList("developers.Inst");
metadata.addFieldAsList("developers.MBalance");
metadata.addFieldAsList("developers.MDemand");
metadata.addFieldAsList("developers.MInterest");
metadata.addFieldAsList("developers.MPrincipal");
metadata.addFieldAsList("developers.GBalance");
metadata.addFieldAsList("developers.GDemand");
metadata.addFieldAsList("developers.GInterest");
metadata.addFieldAsList("developers.GPrincipal");
metadata.addFieldAsList("developers.Members");
metadata.addFieldAsList("developers.Month");
report.setFieldsMetadata(metadata);
IContext context = report.createContext();
List<Developer> developers = new ArrayList<Developer>();
List<LoanRepaymentSchedule> repay = this.loanService
.getLoanRepaymentScheduleById(groupLoan.getLoanId()
.longValue());
LoanRepaymentSchedule rep = repay.get(repay.size() - 1);
Project project = new Project(lt, loan.getGroupName(),
lastFiveDigitsAccNo, groupDto.getVillageName(),
groupDto.getCluster(), groupDto.getClusterCentre(),
groupDto.getRegion(), intLoanAmount, loan.getLoanAccNo(),
Long.valueOf(loan.getLoanInstallments()),
loan.getGroupId(), decIntRate, loan.getAnimator(),
loan.getRep1(), loan.getRep2(), noOfDays, brokenPeriod,
sanctionDate, lastFiveDigitsAccNo, strSancDate,
rep.getMemberCount());
context.put("project", project);
for (Iterator iterator = repay.iterator(); iterator.hasNext();) {
LoanRepaymentSchedule loanRepaymentSchedule = (LoanRepaymentSchedule) iterator
.next();
String month;
Integer year = loanRepaymentSchedule.getYear();
Integer formattedDate = year % 100;
developers.add(new Developer(intgBal, intgDem, intgInt,
intgPri, intmBal, intmDem, intmInt, intmPri, month,
loanRepaymentSchedule.getMemberCount(),
loanRepaymentSchedule.getMemberCount(),
loanRepaymentSchedule.getSerialNo()));
context.put("developers", developers);
}
// OutputStream out = new FileOutputStream(new File(conv+".pdf"));
OutputStream out = new FileOutputStream(new File(files + "_" + groupID
+ ".pdf"));
Options options = Options.getTo(ConverterTypeTo.PDF).via(
ConverterTypeVia.XWPF);
report.convert(context, options, out);
This is my tamil font docx and gave as a input
The generated output looks like below.
Any ideas in this would be appreciated.

Exception when attempting to generate variable-page PDF with iText

I'm trying to create an auto-filled PDF of a government payroll form, which involves the possibility of a variable number of pages. I'm currently storing each page as a Map, with the keys being the names of the fields and the values being their contents.
At the moment, I have this code:
in = new FileInputStream(inputPDF);
PdfCopyFields adder = new PdfCopyFields(outStream);
PdfReader reader = null;
PdfStamper stamper = null;
ByteArrayOutputStream baos = null;
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(in);
baos = new ByteArrayOutputStream();
stamper = new PdfStamper(reader, baos);
AcroFields form = stamper.getAcroFields();
Map<String, String> page = pages.get(pageNum);
setFieldsToPage(form, pageNum);
populatePage(form, page, pageNum);
stamper.close();
reader = new PdfReader(baos.toByteArray());
adder.addDocument(reader);
}
The methods called are:
private void populatePage(AcroFields form, Map<String, String> pageMap, int pageNum) throws Exception {
ArrayList<String> fieldNames = new ArrayList<String>();
for (String key : pageMap.keySet()) {
fieldNames.add(key);
}
for (String key : fieldNames) {
form.setField(key + pageNum, pageMap.get(key));
}
}
and
private void setFieldsToPage(AcroFields form, int pageNum) {
ArrayList<String> fieldNames = new ArrayList<String>();
Map<String, AcroFields.Item> fields = form.getFields();
for (String fieldName : fields.keySet()) {
fieldNames.add(fieldName);
}
for (String fieldName : fieldNames) {
form.renameField(fieldName, fieldName + pageNum);
}
}
The issue is that this throws an exception on the second iteration through the loop: at reader = new PdfReader(in); I get the following exception:
java.io.IOException: PDF header signature not found.
What am I doing wrong here, and how do I fix it?
EDIT:
Here is the exception:
java.io.IOException: PDF header signature not found.
at com.lowagie.text.pdf.PRTokeniser.checkPdfHeader(Unknown Source)
at com.lowagie.text.pdf.PdfReader.readPdf(Unknown Source)
at com.lowagie.text.pdf.PdfReader.<init>(Unknown Source)
at com.lowagie.text.pdf.PdfReader.<init>(Unknown Source)
By the way, I'm sorry if the formatting is bad - this is my first time using stackoverflow.

Your issue is that you essentially try to read the same input stream multiple times while it is positioned at its end already after the first time:
in = new FileInputStream(inputPDF);
[...]
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(in);
[...]
}
The whole stream is read in the first iteration; thus, in the second one new PdfReader(in) essentially tries to parse an empty file resulting in your
java.io.IOException: PDF header signature not found
You can fix that by simply constructing the PdfReader with the input file path directly every time:
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(inputPDF);
[...]
}
Two more things, though:
You don't close your PdfReader instances after use. In the most recent iText versions implicit closing of readers has been taken out of the code as it collides with numerous use cases. Thus, after you finished working with a reader (this includes that any stamper etc using that reader also is closed), you should close the reader explicitly.
In general, if you have a PDF already in your file system, opening a PdfReader for it via a FileInputStream is very wasteful resource-wise --- a reader initialized with an input stream first completely reads that stream into memory (byte[]) and then parses the in-memory representation; a reader initialized with a file path directly parses on-disc representation.

The exception tells you that the file you're reading doesn't start with %PDF-.
Write a small example that doesn't involve iText and check the first 5 bytes of the InputStream in and you'll find out what you're doing wrong (we can't tell you unless you show us those 5 bytes).

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.

First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.

I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..

You could try OpenOffice API, but there arent many resources out there to tell you how to use it.

You can also try this one: http://www.dancrintea.ro/doc-to-pdf/

Looks like this could be the issue.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Any difference in content extracted by pdfbox and itext - java

Related

How to replace text in a pdf with correct encoding using Itext

java.net.malformedURL exception

How to give tamil word docx to java to make PDF

Exception when attempting to generate variable-page PDF with iText

Open Microsoft Word in Java

Categories

Resources