To read a pdf file, I have used below code segment working with the iText library. However, for some pdf documents, it throws an exception which is shown at below of the code. I donot understand why this exception is sent for some document but for some other it is not thrown. Moreover, how can I solve this problem?
NOTE: Below code is for extracting text from pdf, i.e. pd fto txt converter
private ArrayList<byte[]> contentOfPdf() {
PdfReader reader = null;
PdfDictionary dictionary = null;
PRIndirectReference reference = null;
PRStream contentStream = null;
ArrayList<byte []> byteStream = new ArrayList<byte []>();
try{
reader = new PdfReader(this.filename);
for(int currentPage = 0 ; currentPage <= this.totalPageNumber ; currentPage ++ ) {
dictionary = reader.getPageN(currentPage);
reference = (PRIndirectReference) dictionary.get(PdfName.CONTENTS);
/*line 166*/ contentStream = (PRStream) PdfReader.getPdfObject(reference);
byteStream.add( PdfReader.getStreamBytes(contentStream) );
}
} catch(Exception e){
e.printStackTrace();
} finally {
reader.close();
}
return byteStream;
}
Exception :
java.lang.ClassCastException: com.itextpdf.text.pdf.PdfArray cannot be cast to com.itextpdf.text.pdf.PRStream
at pdfCrawler.retrieveContentOfPdf(CrawlerTask.java:166)
at pdfCrawler.call(CrawlerTask.java:55)
at pdfCrawler..call(CrawlerTask.java:1)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Whenever you manually walk a PDF I'd strongly recommend that you have a copy of the PDF spec nearby and look up each and every key. In your case, if you look up the CONTENTS key you'll see that it says:
The value shall be either a single stream or an array of streams.
I'm not a Java guy but the below C# code should be easily converted to Java and should do what you're looking for:
//Will hold an array of references
PdfArray refs = null;
//If we have an array, use it directly
if (dictionary.Get(PdfName.CONTENTS).IsArray()) {
refs = dictionary.GetAsArray(PdfName.CONTENTS);
//If we have just a reference, wrap it in a single item array for convenience
} else if (dictionary.Get(PdfName.CONTENTS).IsIndirect()) {
refs = new PdfArray(dictionary.Get(PdfName.CONTENTS));
//Sanity check, should never happen for conforming PDFs
} else {
throw new ApplicationException("Unknown CONTENTS types");
}
//Loop through each reference
foreach (var r in refs) {
//Same code here
reference = (PRIndirectReference)r;
contentStream = (PRStream)PdfReader.GetPdfObject(reference);
byteStream.Add(PdfReader.GetStreamBytes(contentStream));
}
Related
I create a java program for translating PDFs. I am using google API for translation. I am getting the translation correct on my Eclipse IDE Console but when I check the newly created pdf, either it's not translated and copied as it is or few words are translated or the new pdf comes as empty and sometimes corrupted.
I suppose it has something to do with encoding & font types.
I have already gone through the Itext page & all the related questions but none worked for my case. I am trying to translate Portuguese Spanish Finnish French Hungarian, etc into English.
Here is my code:
public static final String SRC = "5587309Finnish.pdf";
public static final String DEST = "changed.pdf";
public static void main(String[] args) throws java.io.IOException, DocumentException {
Translate translate = TranslateOptions.getDefaultInstance().getService();
PdfReader reader = new PdfReader(SRC);
int pages = reader.getNumberOfPages();
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
for(int i=1;i<=pages;i++) {
PdfDictionary dict = reader.getPageN(i);
PdfObject object = dict.getDirectObject(PdfName.CONTENTS);
if (object instanceof PRStream) {
String pageContent =
PdfTextExtractor.getTextFromPage(reader, i);
String[] word = pageContent.split(" ");
PRStream stream = (PRStream) object;
byte[] data = PdfReader.getStreamBytes(stream);
String dd = new String(data, BaseFont.CP1252);
for (int j=0; j < word.length; j++)
{
Translation translation = translate.translate(word[j],Translate.TranslateOption.sourceLanguage("fi"),
Translate.TranslateOption.targetLanguage("en"));
System.out.println(word[j]+"-->>"+translation.getTranslatedText());//here i can check the translation is correct.
dd = dd.replace(word[j],translation.getTranslatedText());
}
stream.setData(dd.getBytes());
}
}
stamper.close();
reader.close();
}
Please help.
According to a comment you have improved your code and are
getting the update dd(i.e. content stream which I am printing) correctly with the replaced text. I don't know why I am getting a blank pdf
Thus, I assume that your (hopefully representative) test PDFs have all their fonts of interest encoded in ANSI'ish encodings and the text arguments of the text drawing instructions contain whole words or even phrases which can properly be processed because otherwise text replacement would not have been possible.
Thus, here an example how one can replace text pieces with similarly long ones under such benign circumstances without breaking the content stream syntax. In this example I simply use a Map containing replacement strings. You can do your translation there.
First a frame loading the source, creating a stamper, iterating over the pages, and calling a helper to create a content stream replacement:
Map<String, String> replacements = new HashMap<>();
replacements.put("Förfallodatum", "Ablaufdatum");
try ( InputStream resource = SOURCE_INPUTSTREAM;
OutputStream result = new FileOutputStream(RESULT_FILE) ) {
PdfReader pdfReader = new PdfReader(resource);
PdfStamper pdfStamper = new PdfStamper(pdfReader, result);
for (int pageNum = 1; pageNum <= pdfReader.getNumberOfPages(); pageNum++) {
PdfDictionary page = pdfReader.getPageN(pageNum);
byte[] pageContentInput = ContentByteUtils.getContentBytesForPage(pdfReader, pageNum);
page.remove(PdfName.CONTENTS);
replaceInStringArguments(pageContentInput, pdfStamper.getUnderContent(pageNum), replacements);
}
pdfStamper.close();
}
(EditPageContentSimple test testReplaceInStringArgumentsForklaringAvFakturan)
The method replaceInStringArguments now parses the instructions in the given content stream, isolates string arguments, and calls another helper for each string argument doing the replacement.
void replaceInStringArguments(byte[] contentBytesBefore, PdfContentByte canvas, Map<String, String> replacements) throws IOException {
PRTokeniser tokeniser = new PRTokeniser(new RandomAccessFileOrArray(new RandomAccessSourceFactory().createSource(contentBytesBefore)));
PdfContentParser ps = new PdfContentParser(tokeniser);
ArrayList<PdfObject> operands = new ArrayList<PdfObject>();
while (ps.parse(operands).size() > 0){
for (int i = 0; i < operands.size(); i++) {
PdfObject pdfObject = operands.get(i);
if (pdfObject instanceof PdfString) {
operands.set(i, replaceInString((PdfString)pdfObject, replacements));
} else if (pdfObject instanceof PdfArray) {
PdfArray pdfArray = (PdfArray) pdfObject;
for (int j = 0; j < pdfArray.size(); j++) {
PdfObject arrayObject = pdfArray.getPdfObject(j);
if (arrayObject instanceof PdfString) {
pdfArray.set(j, replaceInString((PdfString)arrayObject, replacements));
}
}
}
}
for (PdfObject object : operands)
{
object.toPdf(canvas.getPdfWriter(), canvas.getInternalBuffer());
canvas.getInternalBuffer().append((byte) ' ');
}
canvas.getInternalBuffer().append((byte) '\n');
}
}
(EditPageContentSimple helper method)
The method replaceInString in turn retrieves a single string operand (a PdfString instance), manipulates it, and returns the manipulated string version:
PdfString replaceInString(PdfString string, Map<String, String> replacements) {
String value = PdfEncodings.convertToString(string.getBytes(), PdfObject.TEXT_PDFDOCENCODING);
for (Map.Entry<String, String> entry : replacements.entrySet()) {
value = value.replace(entry.getKey(), entry.getValue());
}
return new PdfString(PdfEncodings.convertToBytes(value, PdfObject.TEXT_PDFDOCENCODING));
}
(EditPageContentSimple helper method)
Instead of that for loop here you would call your translation routine and translate value.
As has been mentioned before, this code only works under certain benign circumstances. Don't expect it to work for arbitrary documents from the wild, in particular not for documents with other than Western European glyphs.
I'm using Lucene 5.1.0. After Analyzing and indexing a document, I would like to get a list of all the terms indexed that belong to this specific document.
{
File[] files = FILES_TO_INDEX_DIRECTORY.listFiles();
for (File file : files) {
Document document = new Document();
Reader reader = new FileReader(file);
document.add(new TextField("fieldname",reader));
iwriter.addDocument(document);
}
iwriter.close();
IndexReader indexReader = DirectoryReader.open(directory);
int maxDoc=indexReader.maxDoc();
for (int i=0; i < maxDoc; i++) {
Document doc=indexReader.document(i);
String[] terms = doc.getValues("fieldname");
}
}
the terms return null. Is there a way to get the saved terms per document?
Here is a sample code for the answer, using a TokenStream
TokenStream ts= analyzer.tokenStream("myfield", reader);
// The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s),
// and pass the resulting Reader to the Tokenizer.
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
String term = charTermAttribute.toString();
System.out.println(term);
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
I'm trying to create an auto-filled PDF of a government payroll form, which involves the possibility of a variable number of pages. I'm currently storing each page as a Map, with the keys being the names of the fields and the values being their contents.
At the moment, I have this code:
in = new FileInputStream(inputPDF);
PdfCopyFields adder = new PdfCopyFields(outStream);
PdfReader reader = null;
PdfStamper stamper = null;
ByteArrayOutputStream baos = null;
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(in);
baos = new ByteArrayOutputStream();
stamper = new PdfStamper(reader, baos);
AcroFields form = stamper.getAcroFields();
Map<String, String> page = pages.get(pageNum);
setFieldsToPage(form, pageNum);
populatePage(form, page, pageNum);
stamper.close();
reader = new PdfReader(baos.toByteArray());
adder.addDocument(reader);
}
The methods called are:
private void populatePage(AcroFields form, Map<String, String> pageMap, int pageNum) throws Exception {
ArrayList<String> fieldNames = new ArrayList<String>();
for (String key : pageMap.keySet()) {
fieldNames.add(key);
}
for (String key : fieldNames) {
form.setField(key + pageNum, pageMap.get(key));
}
}
and
private void setFieldsToPage(AcroFields form, int pageNum) {
ArrayList<String> fieldNames = new ArrayList<String>();
Map<String, AcroFields.Item> fields = form.getFields();
for (String fieldName : fields.keySet()) {
fieldNames.add(fieldName);
}
for (String fieldName : fieldNames) {
form.renameField(fieldName, fieldName + pageNum);
}
}
The issue is that this throws an exception on the second iteration through the loop: at reader = new PdfReader(in); I get the following exception:
java.io.IOException: PDF header signature not found.
What am I doing wrong here, and how do I fix it?
EDIT:
Here is the exception:
java.io.IOException: PDF header signature not found.
at com.lowagie.text.pdf.PRTokeniser.checkPdfHeader(Unknown Source)
at com.lowagie.text.pdf.PdfReader.readPdf(Unknown Source)
at com.lowagie.text.pdf.PdfReader.<init>(Unknown Source)
at com.lowagie.text.pdf.PdfReader.<init>(Unknown Source)
By the way, I'm sorry if the formatting is bad - this is my first time using stackoverflow.
Your issue is that you essentially try to read the same input stream multiple times while it is positioned at its end already after the first time:
in = new FileInputStream(inputPDF);
[...]
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(in);
[...]
}
The whole stream is read in the first iteration; thus, in the second one new PdfReader(in) essentially tries to parse an empty file resulting in your
java.io.IOException: PDF header signature not found
You can fix that by simply constructing the PdfReader with the input file path directly every time:
for (int pageNum = 0; pageNum < numPages; pageNum++) {
reader = new PdfReader(inputPDF);
[...]
}
Two more things, though:
You don't close your PdfReader instances after use. In the most recent iText versions implicit closing of readers has been taken out of the code as it collides with numerous use cases. Thus, after you finished working with a reader (this includes that any stamper etc using that reader also is closed), you should close the reader explicitly.
In general, if you have a PDF already in your file system, opening a PdfReader for it via a FileInputStream is very wasteful resource-wise --- a reader initialized with an input stream first completely reads that stream into memory (byte[]) and then parses the in-memory representation; a reader initialized with a file path directly parses on-disc representation.
The exception tells you that the file you're reading doesn't start with %PDF-.
Write a small example that doesn't involve iText and check the first 5 bytes of the InputStream in and you'll find out what you're doing wrong (we can't tell you unless you show us those 5 bytes).
I am using iText PdfTextExtractor to extract text from the PdfReader, where the PdfReader is created from a byte array,
byte[] pdfbytes = outputStream.toByteArray();
PdfReader reader = new PdfReader(pdfbytes);
int pagenumber = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
for(int i = 1; i<= pagenumber; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = extractor.getTextFromPage(i);
System.out.println(line);
}
The first test pdf is from: http://www.gnostice.com/downloads/Gnostice_PathQuest.pdf
I can print out the first page, but get the follow exception at the second page
Exception:
Exception in thread "main" ExceptionConverter: java.io.IOException: Error reading string at file pointer 238291
at com.lowagie.text.pdf.PRTokeniser.throwError(Unknown Source)
at com.lowagie.text.pdf.PRTokeniser.nextToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.nextValidToken(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.readPRObject(Unknown Source)
at com.lowagie.text.pdf.PdfContentParser.parse(Unknown Source)
at com.lowagie.text.pdf.parser.PdfContentStreamProcessor.processContent(Unknown Source)
at com.lowagie.text.pdf.parser.PdfTextExtractor.getTextFromPage(Unknown Source)
at org.xxx.services.pdfparser.xxxExtensionPdfParser.main(xxxExtensionPdfParser.java:114)
where xxxExtensionPdfParser.java:114 is String line = extractor.getTextFromPage(i);
But at second test at http://www.irs.gov/pub/irs-pdf/fw4.pdf, I can get text content without exception. So i think it must be the format issue of first pdf that causes the exception.
So my question is, what is this format issue and is there anyway to avoid it? Thanks.
I am getting the same error and upon some investigation, it seems that the problem with my pdf documents is that they contain 'header' or 'footer' as opposed to the irs document you've linked. I indexed a 900 page pdf document and about 70 of the pages fail to extract. Apparently, all these pages have a footer copyright information. Any ideas how to resolve this issue ?
------EDIT ----------
I applied the following method to get text out from the aforementioned pdf. Hope this works for you as well.
PdfReader pdfReader = new PdfReader(file);
PdfReaderContentParser parser = new PdfReaderContentParser(pdfReader);
strategy = parser.processContent(currentPage, new SimpleTextExtractionStrategy());
content = strategy.getResultantText();
byte[] pdfbytes = outputStream.toByteArray();
PdfReader reader = new PdfReader(pdfbytes);
int pagenumber = reader.getNumberOfPages();
PdfTextExtractor extractor = new PdfTextExtractor(reader);
for(int i = 1; i<= pagenumber; i++) {
System.out.println("============PAGE NUMBER " + i + "=============" );
String line = PdfTextExtractor.getTextFromPage(reader,i);
System.out.println(line);
}
replace your code with this it will work fine..
I've been trying to work out how to obtain the travel time between two locations (walking, driving etc...).
As I understand it, the only way to do this accurately is by retrieving a KML file from google, then parsing it.
Research has shown that it then needs to be parsed with SAX. The problem is, I can't seem to work out how to extract the correct variables (the time). Does anybody know if / how this can be done?
Many thanks for your help,
Pete.
Parsing XML (what KML basically is), using a SAX-Parser: http://www.dreamincode.net/forums/blog/324/entry-2683-parsing-xml-in-java-part-1-sax/
<kml>
<Document>
<Placemark>
<name>Route</name>
<description>Distance: 1.4 mi (about 30 mins)<br/>Map data ©2011 Tele Atlas </description>
</Placemark>
</Document>
</kml>
In the example you can see, that the guessed time is stored in the "description"-Tag. It's saved in the last "Placemark"-Tag in the KML-File and it has a "<name>Route</name>"-Tag.
Getting this Tag with the SAX-Parser and extracting the time using regex should be easy done.
Here's my JSOUP implementation for getting tracks
public ArrayList<ArrayList<LatLng>> getAllTracks() {
ArrayList<ArrayList<LatLng>> allTracks = new ArrayList<ArrayList<LatLng>>();
try {
StringBuilder buf = new StringBuilder();
InputStream json = MyApplication.getInstance().getAssets().open("track.kml");
BufferedReader in = new BufferedReader(new InputStreamReader(json));
String str;
while ((str = in.readLine()) != null) {
buf.append(str);
}
in.close();
String html = buf.toString();
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
ArrayList<String> tracksString = new ArrayList<String>();
for (Element e : doc.select("coordinates")) {
tracksString.add(e.toString().replace("<coordinates>", "").replace("</coordinates>", ""));
}
for (int i = 0; i < tracksString.size(); i++) {
ArrayList<LatLng> oneTrack = new ArrayList<LatLng>();
ArrayList<String> oneTrackString = new ArrayList<String>(Arrays.asList(tracksString.get(i).split("\\s+")));
for (int k = 1; k < oneTrackString.size(); k++) {
LatLng latLng = new LatLng(Double.parseDouble(oneTrackString.get(k).split(",")[0]), Double.parseDouble(oneTrackString.get(k)
.split(",")[1]));
oneTrack.add(latLng);
}
allTracks.add(oneTrack);
}
} catch (Exception e) {
e.printStackTrace();
}
return allTracks;
}