Can't Read RTF ANSi File contains Arabic Characters

Can't Read RTF ANSi File contains Arabic Characters - java

I have RTF files are encoded in ANSI while it contains Arabic phrases. I'm trying to read this file but couldn't read it in the right encoding.
RTF File:
{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fnil\fcharset178 MS Sans Serif;}{\f1\fnil\fcharset0 MS Sans Serif;}}
\viewkind4\uc1\pard\ltrpar\lang12289\f0\rtlch\fs16\'ca\'d1\'cc\'e3\'c9: \'d3\'e3\'ed\'d1 \'c7\'e1\'e3\'cc\'d0\'e6\'c8\f1\ltrch\par
}
and my java code is:
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
rtf.read(new InputStreamReader(new FileInputStream("Document.rtf"), "windows-1256"),doc,0);
System.out.println(doc.getText(0,doc.getLength()));
and the wrong output is:
ÊÑÌãÉ: ÓãíÑ ÇáãÌÐæÈ

Try RTFParserKit, this should correctly support encodings like the ones you describe.
Here is the text it extracted from your example:
ترجمة: سمير المجذوب
I used the RtfDump class which ships with RTFParserKit to dump the RTF content into an XML file. The class invokes the StandardRtfParser on the supplied input file, while the RtfDumpListener class receives the events raised by the parser as the file is read, adding content to the XML file as it goes.

Related

How to set character encoding for PDFBox

I'm bulding a pdf-parser using Apache PDFBox, after parsing the plain text i run some algorithms and in the end output a json-file. For some pdf files the output file contains utf-8 encoding, for other pdfs it contains some form of what seems to be latin-1 encoding (spaces show up as "\xa0" when the json-file is opened in python). I assume this must be a consequence of the fonts or some other characteristic of the pdf?
My code to read the plain text is as follows
PDDocument document = PDDocument.load(file);
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
//Closing the document
document.close();
I've tried just saving the plain text:
PrintWriter out = new PrintWriter(outPath + ".txt");
out.print(text);
Even opening this plain text file in python yields "\xa0" characters instead of space if the file is read into a dictionary , yielding the following results:
dict_keys(['1.\xa0\lorem\xa0ipsum', '2.\xa0\lorem\xa0ipsum\xa0\lorem\xa0ipsum', '3.\xa0\lorem', '4.\xa0\lorem\xa0ipsum', '5.\xa0\lorem\xa0ipsum'])
I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?

I'd like to make sure the text always gets encoded as utf-8. How do I go about doing this?
If you want to make sure your PrintWriter uses UTF-8 encoding, say so in the constructor:
PrintWriter out = new PrintWriter(outPath + ".txt", "UTF-8");

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

I have written an application in Java and duplicated it in C#. The application reads and writes text files with tab delimited data to be used by an HMI software. The HMI software requires UTF or ANSI encoding for the degree symbol to be displayed correctly or I would just use ASCII which seems to work fine. The C# application can open files saved by either with no problem. The java application reads files it saved perfectly but there is a small problem that crops up when reading the files saved with C#. It throws a numberformatexception when parsing the first character in the file to and int. This character is always a "1". I have opened both files up with editpadlight and they appear to be identical even when viewed with encoding and the encoding is UTF-16LE. I'm racking my brain on this, any help would be appreciated.
lines = FileUtils.readLines(file, "UTF-16LE");
Integer.parseInt(line[0])
I cannot see any difference between the file saved in C# and the one saved in Java
Screen Shot of Data in EditPad Lite
if(lines.get(0).split("\\t")[0].length() == 2){
lines.set(0, lines.get(0).substring(1));
}

Your .NET code is probably writing a BOM. Compliant readers of Unicode, strip off any BOM since it is meta-data, not part of the text data.
Your Java code explicitly specifies the byte order
FileUtils.readLines(file, "UTF-16LE");
It's somewhat of a Catch-22; If the source has a BOM then you can read it as "UTF-16". If it doesn't then you can read it as "UTF-16LE" or "UTF-16BE" as you know which it is.
So, either write it with a BOM and read it without specifying the byte order, or, write it without a BOM and read it specifying the byte order.
With a BOM:
[C#]
File.WriteAllLines(file, lines, Encoding.Unicode);
[Java]
FileUtils.readLines(file, "UTF-16");
Without a BOM:
[C#]
File.WriteAllLines(file, lines, new UnicodeEncoding(false));
[Java]
FileUtils.readLines(file, "UTF-16LE");

In my java code I read the file normally, I just specified char encoding into the InputStreamReader
File file = new File(fileName);
InputStreamReader fis = new InputStreamReader(new FileInputStream(file), "UTF-16LE");
br = new BufferedReader(fis);
String line = br.readLine();

Converting base64 encoded pdf to file input stream without writing file to system

I am getting a base64 PDF input as part of a request from my client, I need to convert this file into to File Input Stream to use with the PDF BOX library. I am trying to achieve this without writing the file onto drive and directly reading the base64 pdf into File Input Stream.
What I am able to do:
Convert base64 encoded pdf to File and write on the drive
Read the file into a File Input Stream
What I want to do:
Convert base64 encoded PDF to File input stream to use it with PDFBOX.
I am trying to avoid writing the file onto the disk.

In pdfbox you can load a pdf from an InputStream. it does not need to be FileInputStream.
Using the following code you can get an inputStream of the decoded pdf.
Note: depending on the scheme used to encode pdf you may need to use: .getMimeDecoder() instead of .getDecoder().
InputStream decoded = java.util.Base64.getDecoder().wrap(inputStream);
PDDocument.load(decoded);

best aproach to check if xml is part of a pdf document in Java

I want to check if a pdf file contains a long string, which is a string of a full XML document.
I can open both files and extract the text already. i've done that with the following code:
File temp = File.createTempFile("temp-pdf", ".tmp");
OutputStream out = new FileOutputStream(temp);
out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
out.close();
PDDocument document = PDDocument.load(temp);
PDFTextStripper pdfStripper = new PDFTextStripper();
String pdfText = pdfStripper.getText(document);
Integer posS =pdfText.indexOf("<?xml version");
Integer posE = pdfText.lastIndexOf("</ServiceSpecificationSchema:serviceSpecification>")+"</ServiceSpecificationSchema:serviceSpecification>".length();
pdfText =pdfText.substring( posS,posE );
String xmlText = testObject.getXmlAsDoc().getContent();
Now i have the problem, that the lines of both documents don't match, a cause of formats like linebreaks from the pdf file.
Example lines of TXT output from XML file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification xmlns:xs=" ..... >
Example lines of TXT output from PDF file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification
xmlns:xs=" ..... >
Second, i have page numbers between the XML tags from the PDF. Do you know a good way to remove this lines?
</operations>
Page 51 of 52
</consumerInterface>
What is the best approach to check if the pdf contains an XML?
I've already tried to remove all linebreaks and whitespaces from the file and compare them. But if i do that, i cannot find a line with the difference.
It does not have to be a valid XML file at the end.

Just want to post my solution if others need it.
My code is a little to large, to post it here.
Basicly i extract the text from the pdf and remove strings like page x and headlines from it. After that i removed all whitespaces as pointed out above. Finally i compare character by character of the extracted string to inform my users where they have done things wrong in the text. This method works pretty well, even if the auther does not care about formatting and just copy and paste the whole xml document.

how to read utf-8 chars in opencsv

I am trying to read from csv file. The file contains UTF-8 characters. So based on Parse CSV file containing a Unicode character using OpenCSV and How read Japanese fields from CSV file into java beans? I just wrote
CSVReader reader = new CSVReader(new InputStreamReader(new FileInputStream("data.csv"), "UTF-8"), ';');
But it does not work. The >>Sí, es nuevo<< text is visible correctly in Notepad, Excel and various other text editing tools, but when I parse the file via opencsv I'm getting >>S�, es nuevo<< ( The í is a special character if you were wondering ;)
What am I doing wrong?

you can use encoder=UTF-16LE,I'm write a file for Japanese

Thanks aioobe. It turned out the file was not really UTF-8 despite most Win programs showing it as such. Notepad++ was the only one that did not show the file as UTF-8 encoded and after converting the data file the code works.

Use the below code for your issue it might helpful to you...
String value = URLEncoder.encode(msg[no], "UTF-8");
thanks,
Yash

Use ISO-8859-1 or ISO-8859-14 or ISO-8859-15 or ISO-8859-10 or ISO-8859-13 or ISO-8859-2 instead of using UTF-8

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't Read RTF ANSi File contains Arabic Characters - java

Related

How to set character encoding for PDFBox

Reading and Writing Text files with UTF-16LE encoding and Apache Commons IO

Converting base64 encoded pdf to file input stream without writing file to system

best aproach to check if xml is part of a pdf document in Java

how to read utf-8 chars in opencsv

Categories

Resources