i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();
I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.
Related
My goal is to insert a docx (with keeping the style / formatting) into another docx's specific row. In the second docx there is a word, "placeholder" and first, I have to find this word, and then change it to first docx text, keeping the inserted docx styles and formats.
I have an idea. Maybe I should create a new docx, divide the second docx with the "placeholder", put the first part to the new docx, then put the whole docx, and then put the second part of the second docx. But how can I keep the styles and formats? I don't have images / tablets or anything, just texts and formatting stuff, like lists, tabs, text style, etc.
Currently I am using apache POI and java. (I tried docx4j, but I had less success)
The example code does a simple merging but nothing more. How can I find the "placeholder" word and insert my docx there?
public static void merge(InputStream src1, InputStream src2, OutputStream dest) throws Exception {
OPCPackage src1Package = OPCPackage.open(src1);
OPCPackage src2Package = OPCPackage.open(src2);
XWPFDocument src1Document = new XWPFDocument(src1Package);
CTBody src1Body = src1Document.getDocument().getBody();
XWPFDocument src2Document = new XWPFDocument(src2Package);
CTBody src2Body = src2Document.getDocument().getBody();
appendBody(src1Body, src2Body);
src1Document.write(dest);
}
private static void appendBody(CTBody src, CTBody append) throws Exception {
XmlOptions optionsOuter = new XmlOptions();
optionsOuter.setSaveOuter();
String appendString = append.xmlText(optionsOuter);
String srcString = src.xmlText();
String prefix = srcString.substring(0, srcString.indexOf(">") + 1);
String mainPart = srcString.substring(srcString.indexOf(">") + 1, srcString.lastIndexOf("<"));
String suffix = srcString.substring(srcString.lastIndexOf("<"));
String addPart = appendString.substring(appendString.indexOf(">") + 1, appendString.lastIndexOf("<"));
CTBody makeBody = CTBody.Factory.parse(prefix + mainPart + addPart + suffix);
src.set(makeBody);
}
Re docx4j you can insert a docx at a specific location (eg in a table cell) using MergeDocx in our commercial Docx4j Enterprise.
You can get a trial version from https://www.plutext.com/m/index.php/products
Then see the MergeIntoTableCell sample and documentation.
Other solution is: in my example in mainPart, we can find the text (using indexof / lastindexof / substring are better, than using regex) and add (and replace the text to) the addPart and ready to go.
2 possible problem:
1: if we have numbered lists / bulleted lists in addPart, that can be be messy after adding to the other document.
2: inserting picture is not possible in this way, it has to be handle in other way.
I am required to replace a word in an existing PDF AcroField with another word. I am using PDFStamper of iTEXTSHARP to do the same and it is working fine. But, in doing so it is required to create a new PDF and i would like the change to be reflected in the existing PDF itself. If I am setting the destination filename same as the original filename then no change is being reflected.I am new to iTextSharp , is there anything I am doing wrong? Please help.. I am providing the piece of code I am using
private void ListFieldNames(string s)
{
try
{
string pdfTemplate = #"z:\TEMP\PDF\PassportApplicationForm_Main_English_V1.0.pdf";
string newFile = #"z:\TEMP\PDF\PassportApplicationForm_Main_English_V1.0.pdf";
PdfReader pdfReader = new PdfReader(pdfTemplate);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
PdfReader reader = new PdfReader((string)pdfTemplate);
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(newFile, FileMode.Create, FileAccess.ReadWrite)))
{
AcroFields form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
//Replace Address Form field with my custom data
if (fieldKey.Contains("Address"))
{
form.SetField(fieldKey, s);
}
}
stamper.FormFlattening = true;
stamper.Close();
}
}
}
As documented in my book iText in Action, you can't read a file and write to it simultaneously. Think of how Word works: you can't open a Word document and write directly to it. Word always creates a temporary file, writes the changes to it, then replaces the original file with it and then throws away the temporary file.
You can do that too:
read the original file with PdfReader,
create a temporary file for PdfStamper, and when you're done,
replace the original file with the temporary file.
Or:
read the original file into a byte[],
create PdfReader with this byte[], and
use the path to the original file for PdfStamper.
This second option is more dangerous, as you'll lose the original file if you do something that causes an exception in PdfStamper.
I am using iText for extraction of data from PDFs. My application is able to read PDFs with English characters, but we found a new file with Chinese characters. When I tried to extract that data, I get an error:
ExceptionConverter: com.itextpdf.text.DocumentException: Font 'STSong-Light' with 'UniGB-UCS2-H' is not recognized.
So I added itext-asian.jar. Now I am not getting an error, but getTextFromPage()
returns an empty string. Am I missing something?
PdfReader pr = new PdfReader(inputPdf);
// get the number of pages in the document
PdfTextExtractor pte =
new PdfTextExtractor(pr, new CustomLocationAwarePdfRenderListener(scanDepth));
int pNum = pr.getNumberOfPages();
String text = "";
// extract text from each page and write it to the output text file
for (int page = 1; page <= pNum; page++) {
text = text.concat("\n").concat(pte.getTextFromPage(page));
}
This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Using PDFBox to write UTF-8 encoded strings to a PDF
I need to create PDF with Czech national characters, and I'm trying to do it with PDFBox library.
I have copied following code from some tutorials:
public void doIt(String file, String message) throws IOException, COSVisitorException
{
PDDocument doc = null;
try
{
doc = new PDDocument();
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
TextToPDF textToPdf = new TextToPDF();
textToPdf.setFont(font);
textToPdf.setFontSize(12);
doc = textToPdf.createPDFFromText(new StringReader(message));
doc.save(file);
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Now, I'am calling function doIt:
app.doIt("test.pdf", "Skákal pes přes oves, přes zelenou louku.");
This completely works, but in output PDF I get: "þÿSkákal pes pYes oves, pYes zelenou louku."
I tried to find how to set UTF-8 encoding in PDFBox, but IMHO there is just no solution for this on the internet.
Do you have any ideas, how to get right text in output PDF?
Thank you.
I think its PDType1Font.TIMES_ROMAN font which is not supporting your Czech national characters. If you can manage to get the .ttf files for the Czech national characters, then use below to get PDFont as below and use the same:
PDFont font = PDTrueTypeFont.loadTTF( doc, new File( "CheckRepFont.ttf" ) );
Here CheckRepFont.ttf is your font file name as an example. Update it with actual one.
EDIT:
PDStream pdStream = new PDStream(doc);
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
font.setToUnicode(pdStream);
I need to parse a java file (actually a .pdf) to an String and go back to a file. Between those process I'll apply some patches to the given string, but this is not important in this case.
I've developed the following JUnit test case:
String f1String=FileUtils.readFileToString(f1);
File temp=File.createTempFile("deleteme", "deleteme");
FileUtils.writeStringToFile(temp, f1String);
assertTrue(FileUtils.contentEquals(f1, temp));
This test converts a file to a string and writtes it back. However the test is failing.
I think it may be because of the encodings, but in FileUtils there is no much detailed info about this.
Anyone can help?
Thanks!
Added for further undestanding:
Why I need this?
I have very large pdfs in one machine, that are replicated in another one. The first one is in charge of creating those pdfs. Due to the low connectivity of the second machine and the big size of pdfs, I don't want to synch the whole pdfs, but only the changes done.
To create patches/apply them, I'm using the google library DiffMatchPatch. This library creates patches between two string. So I need to load a pdf to an string, apply a generated patch, and put it back to a file.
A PDF is not a text file. Decoding (into Java characters) and re-encoding of binary files that are not encoded text is asymmetrical. For example, if the input bytestream is invalid for the current encoding, you can be assured that it won't re-encode correctly. In short - don't do that. Use readFileToByteArray and writeByteArrayToFile instead.
Just a few thoughts:
There might actually some BOM (byte order mark) bytes in one of the files that either gets stripped when reading or added during writing. Is there a difference in the file size (if it is the BOM the difference should be 2 or 3 bytes)?
The line breaks might not match, depending which system the files are created on, i.e. one might have CR LF while the other only has LF or CR. (1 byte difference per line break)
According to the JavaDoc both methods should use the default encoding of the JVM, which should be the same for both operations. However, try and test with an explicitly set encoding (JVM's default encoding would be queried using System.getProperty("file.encoding")).
Ed Staub awnser points why my solution is not working and he suggested using bytes instead of Strings. In my case I need an String, so the final working solution I've found is the following:
#Test
public void testFileRWAsArray() throws IOException{
String f1String="";
byte[] bytes=FileUtils.readFileToByteArray(f1);
for(byte b:bytes){
f1String=f1String+((char)b);
}
File temp=File.createTempFile("deleteme", "deleteme");
byte[] newBytes=new byte[f1String.length()];
for(int i=0; i<f1String.length(); ++i){
char c=f1String.charAt(i);
newBytes[i]= (byte)c;
}
FileUtils.writeByteArrayToFile(temp, newBytes);
assertTrue(FileUtils.contentEquals(f1, temp));
}
By using a cast between byte-char, I have the symmetry on conversion.
Thank you all!
Try this code...
public static String fetchBase64binaryEncodedString(String path) {
File inboundDoc = new File(path);
byte[] pdfData;
try {
pdfData = FileUtils.readFileToByteArray(inboundDoc);
} catch (IOException e) {
throw new RuntimeException(e);
}
byte[] encodedPdfData = Base64.encodeBase64(pdfData);
String attachment = new String(encodedPdfData);
return attachment;
}
//How to decode it
public void testConversionPDFtoBase64() throws IOException
{
String path = "C:/Documents and Settings/kantab/Desktop/GTR_SDR/MSDOC.pdf";
File origFile = new File(path);
String encodedString = CreditOneMLParserUtil.fetchBase64binaryEncodedString(path);
//now decode it
byte[] decodeData = Base64.decodeBase64(encodedString.getBytes());
String decodedString = new String(decodeData);
//or actually give the path to pdf file.
File decodedfile = File.createTempFile("DECODED", ".pdf");
FileUtils.writeByteArrayToFile(decodedfile,decodeData);
Assert.assertTrue(FileUtils.contentEquals(origFile, decodedfile));
// Frame frame = new Frame("PDF Viewer");
// frame.setLayout(new BorderLayout());
}