How to set character encoding for PDFBox

How to set character encoding for PDFBox - java

I'm bulding a pdf-parser using Apache PDFBox, after parsing the plain text i run some algorithms and in the end output a json-file. For some pdf files the output file contains utf-8 encoding, for other pdfs it contains some form of what seems to be latin-1 encoding (spaces show up as "\xa0" when the json-file is opened in python). I assume this must be a consequence of the fonts or some other characteristic of the pdf?
My code to read the plain text is as follows
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
//Closing the document
document.close();
I've tried just saving the plain text:
PrintWriter out = new PrintWriter(outPath + ".txt");
out.print(text);
Even opening this plain text file in python yields "\xa0" characters instead of space if the file is read into a dictionary , yielding the following results:
dict_keys(['1.\xa0\lorem\xa0ipsum', '2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum', '3.\xa0\lorem', '4.\xa0\lorem\xa0ipsum', '5.\xa0\lorem\xa0ipsum'])
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?

I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
If you want to make sure your PrintWriter uses UTF-8 encoding, say so in the constructor:
PrintWriter out = new PrintWriter(outPath + ".txt", "UTF-8");

Related

Extracting font size and boldness from PDF with PDF2Dom

PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.

character ° encoding and visualization in txt file

I have a field in a table that contains the string "Address Pippo p.2 °".
My program read this value and write it into txt file, but the output is:
"Address Pippo p.2 Â°" (Â is unwanted)
I have a problem because the txt file is a positional file.
I open the file with these Java istructions:
FileWriter fw = new FileWriter(file, true);
pw = new PrintWriter(fw);
I want to write the string without strange characters
Any help for me ?
Thanks in advance

Try encoding the string into UTF-8 like this,
File file = new File("D://test.txt");
FileWriter fw = new FileWriter(file, true);
PrintWriter pw = new PrintWriter(fw);
String test = "Address Pippo p.2 °";
ByteBuffer byteBuffer = Charset.forName("UTF-8").encode(test);
test = StandardCharsets.UTF_8.decode(byteBuffer).toString();
pw.write(test);
pw.close();

Java uses Unicode. When you write text to a file, it gets encoded using a particular character encoding. If you don't specify it explicitly, it will use a "system default encoding" which is whatever is configured as default for your particular JVM instance. You need to know what encoding you've used to write the file. Then you need to use the same encoding to read and display the file content. The funny characters you are seeing are probably due to writing the file using UTF-8 and then trying to read and display it in e.g. Notepad using Windows-1252 ("ANSI") encoding.
Decide what encoding you want and stick to it for both reading and writing. To write using Windows-1252, use:
Writer w = new OutputStreamWriter(new FileInputStream(file, true), "windows-1252");
And if you write in UTF-8, then tell Notepad that you want it to read the file in UTF-8. One way to do that is to write the character '\uFEFF' (Byte Order Mark) at the beginning of the file.
If you use UTF-8, be aware that non-ASCII characters will throw the subsequent bytes out of position. So if, for example, a telephone field must always start at byte position 200, then having a non-ASCII character in an address field before it will make the telephone field start at byte position 201 or 202. Using windows-1252 encoding you won't have this issue, but that encoding can't encode all Unicode characters.

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®
This output I obtain when I translated an English sentence. Is there any way to make it readable form ??
The Goal is to translate English Sentence to Hindi. The Hindi translated output is correctly obtained in the console. I need to write it to text file.
The translated sentence is set to "translation" and by getParameter() it is tried to save in to the file.
String translation = request.getParameter("translation");
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(fileDir,true), "UTF-8");
BufferedWriter fbw = new BufferedWriter(writer);
fbw.write(translation);
Output file
Output file 1

This is an issue with mismatching character encoding (like UTF-8).
Make sure the character encoding of data that is returned from the request parameter is in UTF-8 encoding.
If the data is in a different encoding, you will have to use that encoding while writing to the file.

Can't Read RTF ANSi File contains Arabic Characters

I have RTF files are encoded in ANSI while it contains Arabic phrases. I'm trying to read this file but couldn't read it in the right encoding.
RTF File:
{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fnil\fcharset178 MS Sans Serif;}{\f1\fnil\fcharset0 MS Sans Serif;}}
\viewkind4\uc1\pard\ltrpar\lang12289\f0\rtlch\fs16\'ca\'d1\'cc\'e3\'c9: \'d3\'e3\'ed\'d1 \'c7\'e1\'e3\'cc\'d0\'e6\'c8\f1\ltrch\par
}
and my java code is:
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
rtf.read(new InputStreamReader(new FileInputStream("Document.rtf"), "windows-1256"),doc,0);
System.out.println(doc.getText(0,doc.getLength()));
and the wrong output is:
ÊÑÌãÉ: ÓãíÑ ÇáãÌÐæÈ

Try RTFParserKit, this should correctly support encodings like the ones you describe.
Here is the text it extracted from your example:
ترجمة: سمير المجذوب
I used the RtfDump class which ships with RTFParserKit to dump the RTF content into an XML file. The class invokes the StandardRtfParser on the supplied input file, while the RtfDumpListener class receives the events raised by the parser as the file is read, adding content to the XML file as it goes.

how to handle with writing to pdf file chinese characters

i am trying to get text from properties file that he is coded in utf-8 and write it in to a PDF file using document object in java .
Document document = new Document();
File file = new File(FILES_PATH + ".pdf");
FileOutputStream fos = new FileOutputStream(file);
PdfWriter.getInstance(document, fos);
.
.
.
pdfTable table;
document.add(table);
document.close();
when i get just the value from property is ignores Chinese characters .
when i try to encode the string instead Chinese characters i get
strange words or "?".
tried to code it in utf-8 , iso-8859-1 , gbk or gb3212.
need help that PDF file will be able to get Chinese characters

It will not work that way.
In order to display Unicode character in PDF, that is not in build-in PDF fonts, you need to specify custom font for the text frangment and create the separate fragment for each text fragment, that is covered by given font. You need also to embed the used fonts into PDF document (so please consider, if the licence for the fonts you use enables distributing them).
So each String could be rendered using many fonts. But iText has the class FontSelector, that does that task:
FontSelector selector = new FontSelector();
BaseFont bf1 = BaseFont.createFont(fontPath, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
bf1.setSubset(true);
Font font1 = new Font(baseFont, 12, Font.BOLD);
selector.addFont(font1);
// ... do that with all fonts you need
Phrase ph = selector.process(TEXT);
document.add(new Paragraph(ph));
More complex example you can find in my article: Using dynamic fonts for international texts in iText

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to set character encoding for PDFBox - java

I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this? If you want to make sure your PrintWriter uses UTF-8 encoding, say so in the constructor: PrintWriter out = new PrintWriter(outPath + ".txt", "UTF-8");

Related

Extracting font size and boldness from PDF with PDF2Dom

character ° encoding and visualization in txt file

Convert Hindi Output à¤à¥ à¤µà¤¿à¤¨à¤¿à¤¯à¤®à¤¨ à¤à¤§à¤¿à¤¨à¤¿à¤¯à¤®?

Can't Read RTF ANSi File contains Arabic Characters

how to handle with writing to pdf file chinese characters

Categories

Resources