Why is PDFParser generating special characters instead of spaces?

Why is PDFParser generating special characters instead of spaces? - java

The following code is generating special characters instead of spaces for one PDF but not another:
String fullText;
BodyContentHandler handler = null;
try {
// size is limit is 100M
handler = new BodyContentHandler(100 * 1024 * 1024);
Metadata meta = new Metadata();
PDFParser parser = new PDFParser();
parser.setEnableAutoSpace(false);
parser.parse(new FileInputStream(this.pdf /*always a valid pdf file*/), handler, meta, new ParseContext());
}
catch (SAXException e) {
throw new IOException(e);
} catch (TikaException e) {
throw new IOException(e);
}
fullText = handler.toString();
Depending on the PDF a substring of fullText will look like:
will*continue*to*be*used*in*support*of*the
When It should look like this:
will continue to be used in support of the
In other places, '%' substitute '-' and '!' substitute spaces amongst bolded text.
This issue only when processing one PDF but not the other. According to pdfinfo, both PDF's are generated by Quartz PDFContext.
linux command pdftotext renders the same results.
Is this a problem with how the original PDF is generated? Why is this happening?

Related

UTF-16LE encoding and xerces2 Java

I went through a few posts, like FileReader reads the file as a character stream and can be treated as whitespace if the document is handed as a stream of characters where the answers say the input source is actually a char stream, not a byte stream.
However, the suggested solution from 1 does not seem to apply to UTF-16LE. Although I use this code:
try (final InputStream is = Files.newInputStream(filename.toPath(), StandardOpenOption.READ)) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}
I still get org.xml.sax.SAXParseException: Content is not allowed in prolog..
I looked at Files.newInputStream, and it indeed uses a ChannelInputStream which will hand over bytes, not chars. I also tried to set the Encoding of the InputSource object, but with no luck.
I also checked that there are not extra chars (except the BOM) before the <?xml part.
I also want to mention that this code works just fine with UTF-8.
// Edit:
I also tried DocumentBuilderFactory.newInstance().newDocumentBuilder().parse() and XmlInputStreamReader.next(), same results.
// Edit 2:
Tried using a buffered reader. Same results:
Unexpected character '뿯' (code 49135 / 0xbfef) in prolog; expected '<'
Thanks in advance.

To get a bit farther some info gathering:
byte[] bytes = Files.readAllBytes(filename.toPath);
String xml = new String(bytes, StandardCharsets.UTF_16LE);
if (xml.startsWith("\uFEFF")) {
LOG.info("Has BOM and is evidently UTF_16LE");
xml = xml.substring(1);
}
if (!xml.contains("<?xml")) {
LOG.info("Has no XML declaration");
}
String declaredEncoding = xml.replaceFirst("<?xml[^>]*encoding=[\"']([^\"']+)[\"']", "$1");
if (declaredEncoding == xml) {
declaredEncoding = "UTF-8";
}
LOG.info("Declared as " + declaredEncoding);
try (final InputStream is = new ByteArrayInputStream(xml.getBytes(declaredEncoding))) {
DOMParser parser = new org.apache.xerces.parsers.DOMParser();
parser.parse(new InputSource(is));
return parser.getDocument();
} catch (final SAXParseException saxEx) {
LOG.debug("Unable to open [{}}] as InputSource.", absolutePath, saxEx);
}

How do I specify fonts for a specific encoding?

I am using flying saucer library and trying to add a custom font for a specific encoding of letters. So that I could make support for unicode characters.
Here is the link of solution that I follow from official guide of flying saucer library http://flyingsaucerproject.github.io/flyingsaucer/r8/guide/users-guide-R8.html#xil_33.
Below is the code,
public void convertHtmlToPdf(String html, String css, OutputStream out) {
try {
html = correctHtml(html);
html = getFormedHTMLWithCSS(html, css);
HtmlCleaner cleaner = new HtmlCleaner();
TagNode rootTagNode = cleaner.clean(html);
CleanerProperties cleanerProperties = cleaner.getProperties();
XmlSerializer xmlSerializer = new PrettyXmlSerializer(cleanerProperties);
String cleanedHtml = xmlSerializer.getAsString(rootTagNode);
File fontFile = new File("/Verdana.ttf");
FontFactory.register(fontFile.getAbsolutePath());
ITextRenderer r = new ITextRenderer();
r.getFontResolver().addFont(fontFile.getAbsolutePath(), BaseFont.IDENTITY_H, BaseFont.NOT_EMBEDDED);
r.setDocumentFromString(cleanedHtml);
r.layout();
r.createPDF(out);
r.finishPDF();
} catch (Exception e) {
e.printStackTrace();
logger.error(e.getMessage(), e);
}
}
But Still I am unable to encode certain characters. Like,
'■' : '■',
'▲' : '▲',
For '■' i am getting &x25a0; in generated pdf, and likewise for other characters that I try to encode.

How to write to a StyledDocument with a specific charset?

for a NetBeans plugin I want to change the content of a file (which is opened in the NetBeans editor) with a specific String and a specific charset. In order to achieve that, I open the file (a DataObject) with an EditorCookie and then I change the content by inserting a different string to the StyledDocument of my data object.
However, I have a feeling that the file is always saved as UTF-8. Even if I write a file mark in the file. Am I doing something wrong?
This is my code:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument document = cookie.openDocument();
document.remove(0, document.getLength());
document.insertString(0, utf16be, null);
cookie.saveDocument();
} catch (BadLocationException | IOException ex) {
Exceptions.printStackTrace(ex);
}
});
I have also tried this approach which doesn't work too:
...
EditorCookie cookie = dataObject.getLookup().lookup(EditorCookie.class);
NbDocument.runAtomic(cookie.getDocument(), () -> {
try {
StyledDocument doc = cookie.openDocument();
String utf16be = "\uFEFFHello World!";
InputStream is = new ByteArrayInputStream(utf16be.getBytes(StandardCharsets.UTF_16BE));
FileObject fileObject = dataObject.getPrimaryFile();
String mimePath = fileObject.getMIMEType();
Lookup lookup = MimeLookup.getLookup(MimePath.parse(mimePath));
EditorKit kit = lookup.lookup(EditorKit.class);
try {
kit.read(is, doc, doc.getLength());
} catch (IOException | BadLocationException ex) {
Exceptions.printStackTrace(ex);
} finally {
is.close();
}
cookie.saveDocument();
} catch (Exception ex) {
Exceptions.printStackTrace(ex);
}
});

Your problem is probably here:
String utf16be = new String("\uFEFFHello World!".getBytes(StandardCharsets.UTF_16BE));
This won't do what you think it does. This will convert your string to a byte array using the UTF-16 little endian encoding and then create a String from these bytes using the JRE's default encoding.
So, here's the catch:
A String has no encoding.
The fact that in Java this is a sequence of chars does not matter. Substitute 'char' for 'carrier pigeons', the net effect will be the same.
If you want to write a String to a byte stream with a given encoding, you need to specify the encoding you need on the Writer object you create. Similarly, if you want to read a byte stream into a String using a given encoding, it is the Reader which you need to configure to use the encoding you want.
But your StyledDocument object's method name is .insertString(); You should .insertString() your String object as is; don't transform it the way you do, since this is misguided, as explained above.

Extract text from a large pdf with Tika

I try to extract text from a large pdf, but i only get the first pages, i need all text to will be passed to a string variable.
This is the code
public class ParsePDF {
public static void main(String args[]) throws Exception {
try {
File file = new File("C:/vlarge.pdf");
String content = new Tika().parseToString(file);
System.out.println("The Content: " + content);
}
catch (Exception e) {
e.printStackTrace();
}
}
}

From the Javadocs:
To avoid unpredictable excess memory use, the returned string contains
only up to getMaxStringLength() first characters extracted from the
input document. Use the setMaxStringLength(int) method to adjust this
limitation.
Calling setMaxStringLength(-1) will disable this limit.

Try the apache api TIKA. Its working for large PDF's also.
Sample :
InputStream input = new FileInputStream("sample.pdf");
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);

Special characters are not converted correctly from pdf to text

I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.

Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is PDFParser generating special characters instead of spaces? - java

Related

UTF-16LE encoding and xerces2 Java

How do I specify fonts for a specific encoding?

How to write to a StyledDocument with a specific charset?

Extract text from a large pdf with Tika

Special characters are not converted correctly from pdf to text

Categories

Resources