This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Using PDFBox to write UTF-8 encoded strings to a PDF
I need to create PDF with Czech national characters, and I'm trying to do it with PDFBox library.
I have copied following code from some tutorials:
public void doIt(String file, String message) throws IOException, COSVisitorException
{
PDDocument doc = null;
try
{
doc = new PDDocument();
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
TextToPDF textToPdf = new TextToPDF();
textToPdf.setFont(font);
textToPdf.setFontSize(12);
doc = textToPdf.createPDFFromText(new StringReader(message));
doc.save(file);
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Now, I'am calling function doIt:
app.doIt("test.pdf", "Skákal pes přes oves, přes zelenou louku.");
This completely works, but in output PDF I get: "þÿSkákal pes pYes oves, pYes zelenou louku."
I tried to find how to set UTF-8 encoding in PDFBox, but IMHO there is just no solution for this on the internet.
Do you have any ideas, how to get right text in output PDF?
Thank you.
I think its PDType1Font.TIMES_ROMAN font which is not supporting your Czech national characters. If you can manage to get the .ttf files for the Czech national characters, then use below to get PDFont as below and use the same:
PDFont font = PDTrueTypeFont.loadTTF( doc, new File( "CheckRepFont.ttf" ) );
Here CheckRepFont.ttf is your font file name as an example. Update it with actual one.
EDIT:
PDStream pdStream = new PDStream(doc);
PDSimpleFont font = PDType1Font.TIMES_ROMAN;
font.setToUnicode(pdStream);
Related
I have a function to place my text to the document into something like a table.
private static void addCenteredParagraph(Document document, float width, String text) {
PdfFont timesNewRomanBold = null;
try {
timesNewRomanBold = PdfFontFactory.createFont(StandardFonts.TIMES_BOLD);
} catch (IOException e) {
LOGGER.error("Failed to create Times New Roman Bold font.");
LOGGER.error(e);
}
List<TabStop> tabStops = new ArrayList<>();
// Create a TabStop at the middle of the page
tabStops.add(new TabStop(width / 2, TabAlignment.CENTER));
// Create a TabStop at the end of the page
tabStops.add(new TabStop(width, TabAlignment.LEFT));
Paragraph p = new Paragraph().addTabStops(tabStops).setFontSize(14);
if (timesNewRomanBold != null) {
p.setFont(timesNewRomanBold);
}
p.add(new Tab()).add(text).add(new Tab());
document.add(p);
}
But my problem is it shows empty characters in the exported PDF instead of the letters ő,Ő,ű,Ű.
The Times New Roman supports these characters, so I think I need to set it to UTF8, but I couldn't find a workaround on Google to do it.
Can someone please explain how to make these characters appear properly on the pdf?
Tried these, but some of them are deprecated functions, or not applicable with my arguments I'm trying to give them, or I don't use Chunk.
Itext PDF writer, Is there any way to allow unicode subscript symbol in the pdf? (Without setTextRise)
How can I set encoding for iText when I insert value to placeholder in pdf form?
Edit: I figured out, that timesNewRomanBold is null, even if I set it to HELVETICA, or COURIER.
I am trying to convert files with .docx extension to PDF using Java. I need to convert files with shapes and drawings in MS Word. Which libraries(open source or licensed) will serve the purpose?
Currently I have been using "org.apache.poi.xwpf.converter.pdf.PdfConverter" for the purpose, but it skips to convert the shapes or drawings in my Word Document. I am unable to test it using Aspose.words. Any help with that will also be appreciated.
The method I used for conversion is:
public static void createPDFFromIMG(String sSourceFilePath,String sFileName, String sDestinationFilePath) throws Exception {
logger.debug("Entered into createPDFFromIMG()\n");
logger.info("### Started PDF Conversion..");
System.out.println("### Started PDF Conversion..");
try {
if(sFileName.contains(".docx")) {
InputStream doc = new FileInputStream(new File(sSourceFilePath));
XWPFDocument document = new XWPFDocument(doc);
PdfOptions options = PdfOptions.create();
OutputStream out = new FileOutputStream(
new File(sDestinationFilePath + "/" + sFileName.split("\\.")[0] + ".pdf"));
PdfConverter.getInstance().convert(document, out, options);
doc.close();
out.close();
System.out.println("### Completed PDF Conversion..");
logger.info("### Completed PDF Conversion..");
logger.debug("Exited from createPDFFromIMG()");
return;
}
}
I expect the complete Word file to be converted to PDF, but the file converted using the mentioned Java library does not contain drawings or shapes present in the docx file.
It is not actually clear why you unable to test with Aspose.Words. Code is quite simple
Document doc = new Document("in.docx");
Doc.save("out.pdf");
Also you can test with free Aspose App (which is actually based on Aspose.Words)
https://products.aspose.app/words/conversion
I am facing a problem when invoking the setValue method of a PDField and trying to set a value which contains special characters.
field.setValue("TEST-BY (TEST)")
In detail, if my value contains characters as U+00A0 i am getting the following exception:
Caused by: java.lang.IllegalArgumentException: U+00A0 is not
available in this font's encoding: WinAnsiEncoding
A complete stracktrace can be found here: Stacktrace
I currently have set PDType1Font.TIMES_ROMAN as font. In order to solve this problem i tried with other available fonts as well. The same problem persisted.
I found the following suggestion in this answer https://stackoverflow.com/a/22274334/7434590 but since we use the setValue and not any of the methods showText/drawText that can manipulate bytes, i could not use this approach since setValue accepts only string as a parameter.
Note: I cannot replace the characters with others to solve this issue, i must be able to set any kind of supported by the font character in the setValue method.
You'll have to embed a font and not use WinAnsiEncoding:
PDFont formFont = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/somefont.ttf"), false); // check that the font has what you need; ARIALUNI.TTF is good but huge
PDResources res = acroForm.getDefaultResources(); // could be null, if so, then create it with the setter
String fontName = res.add(formFont).getName();
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g"; // adjust to replace existing font name
textField.setDefaultAppearance(defaultAppearanceString);
Note that this code must be ran before calling setValue().
More about this in the CreateSimpleFormWithEmbeddedFont.java example from the source code download.
Avoid using WinAnsiEncoding (problems with encoding)
PDDocument document = new PDDocument();
//Fonts
InputStream fontInputStreamAvenirMedium = new URL(Constants.S3 + "/Fonts/Avenir-Medium.ttf").openStream();
InputStream fontInputStreamAvenirBlack = new URL(Constants.S3 + "/Fonts/Avenir-Black.ttf").openStream();
InputStream fontInputStreamDINCondensedBold = new URL(Constants.S3 + "/Fonts/DINCondensedBold.ttf").openStream();
PDFont font = PDType0Font.load(document, fontInputStreamAvenirMedium);
PDFont fontBold = PDType0Font.load(document, fontInputStreamAvenirBlack);
PDFont fontDIN = PDType0Font.load(document, fontInputStreamDINCondensedBold);
//PDFont font = PDTrueTypeFont.load(document, fontInputStreamAvenirMedium, WinAnsiEncoding.INSTANCE); /* encoding problems */
//PDFont fontBold = PDTrueTypeFont.load(document, fontInputStreamAvenirBlack, WinAnsiEncoding.INSTANCE); /* encoding problems */
//PDFont fontDIN = PDTrueTypeFont.load(document, fontInputStreamDINCondensedBold, WinAnsiEncoding.INSTANCE); /* encoding problems */
See also: https://pdfbox.apache.org/2.0/faq.html#fontencoding
I am using iText for extraction of data from PDFs. My application is able to read PDFs with English characters, but we found a new file with Chinese characters. When I tried to extract that data, I get an error:
ExceptionConverter: com.itextpdf.text.DocumentException: Font 'STSong-Light' with 'UniGB-UCS2-H' is not recognized.
So I added itext-asian.jar. Now I am not getting an error, but getTextFromPage()
returns an empty string. Am I missing something?
PdfReader pr = new PdfReader(inputPdf);
// get the number of pages in the document
PdfTextExtractor pte =
new PdfTextExtractor(pr, new CustomLocationAwarePdfRenderListener(scanDepth));
int pNum = pr.getNumberOfPages();
String text = "";
// extract text from each page and write it to the output text file
for (int page = 1; page <= pNum; page++) {
text = text.concat("\n").concat(pte.getTextFromPage(page));
}
i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();
I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.