I'm currently trying to automatically extract important keywords from a PDF file. I am able to get the text information out of the PDF document. But now I need to know, which font size and font family these keywords have.
The following code I already have:
Main
public static void main(String[] args) throws IOException {
String src = "SEM_081145.pdf";
PdfReader reader = new PdfReader(src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));
}
out.flush();
out.close();
}
And I have implemented the TextExtraction Strategy SemTextExtractionStrategy which looks like this:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
#Override
public void beginTextBlock() {
}
#Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
System.out.println(renderInfo.getFont().getFontType());
System.out.print(text);
}
#Override
public void endTextBlock() {
}
#Override
public void renderImage(ImageRenderInfo renderInfo) {
}
#Override
public String getResultantText() {
return text;
}
}
I can get the FontType but there is no method to get the font size. Is there another way or how can I get the font size of the current text segment?
Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.
Thanks to Alexis I could convert his C# solution into Java code:
text = renderInfo.getText();
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):
val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
case (0, 0) => 0
case _ => length1 / length2
}
You can adapt the code provided in this answer, in particular this code snippet:
Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;
This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.
If you want the exact fontsize, use the following code in your renderText:
float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
- renderInfo.getDescentLine().getStartPoint().get(1);
Modify this as indicated in the other answers for rorated text.
Related
My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(PDType1CFont.java:357)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:333)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:514)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:476)
at haupt.PageTest.printPdf(PageTest.java:294)
at haupt.MyTestPDF.main(MyTestPDF.java:54)
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
createDirectory();
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
input.createNewFile();
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList = pageIterator.next();
PDPage newPage = new PDPage();
document.addPage(newPage);
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
contentStream.beginText();
pageIndex++;
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList = articleIterator.next();
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo = tpIterator.next();
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
contentStream.showText(pdfFileInText);
charIDindex++;
}
}
contentStream.endText();
contentStream.close();
}
} else {
System.out.println("pdf Encrypted");
}
}
}
MyPdfTextStripper:
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
super();
setSortByPosition(true);
}
#Override
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
}
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
setStartPage(pageNr);
setEndPage(pageNr);
getText(doc);
byPageList.add(List.copyOf(getCharactersByArticle()));
}
return byPageList;
}
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
org.apache.pdfbox.pdmodel.font.PDCIDFontType0.encode(int)
#Override
public byte[] encode(int unicode)
{
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
}
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..
I have a question while using itext7. I want to generate a pdf by generating a QR code.
There is no problem with normal creation. However, I want to adjust the size of the dots of the QR.
I don't want to reduce the overall size of the QR code, I just want to reduce the size of the dots in it.
Currently, only the overall size of the QR code can be adjusted.
public class PdfTest {
public static final String DEST = "D:\\output.pdf";
public static final String SRC = "D:\\202106003196.pdf";
static PdfDocument mPdfDocument;
public static void main(String[] args) {
try {
File file = new File(DEST);
file.getParentFile().mkdirs();
FontProgramFactory.registerFont("D:\\font\\MAISONNEUEEXT-MEDIUM.OTF", "MAISONNEUEEXT-MEDIUM");
PdfFont mPdfFont_MAISONNEUEEX_MEDIUM = PdfFontFactory.createRegisteredFont("MAISONNEUEEXT-MEDIUM");
mPdfDocument = new PdfDocument(new PdfReader(SRC), new PdfWriter(DEST));
Document mDocument = new Document(mPdfDocument);
// QR
addQR("HTTP://QR.TEST.COM/0123456", mPdfDocument, 1, 25, 50);
PdfPage pdfPage = mPdfDocument.getPage(1);
Rectangle pageSize = pdfPage.getPageSize();
Rectangle[] mRectangle = {new Rectangle(0, 10, pageSize.getRight(), 50)};
mDocument.setRenderer(new ColumnDocumentRenderer(mDocument, mRectangle));
mDocument.close();
mPdfDocument.close();
} catch (Exception e) {
e.printStackTrace();
}
}
private static void addQR(String mQRCode, PdfDocument mPdfDocument, int mPage, float mX, float mY) {
PdfPage mPdfPage = mPdfDocument.getPage(mPage);
BarcodeQRCode mBarcodeQRCode;
Map<EncodeHintType, Object> mHints = new HashMap<>();
mHints.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.M);
mHints.put(EncodeHintType.MIN_VERSION_NR, 2);
mBarcodeQRCode = new BarcodeQRCode(mQRCode, mHints);
PdfCanvas over = new PdfCanvas(mPdfPage);
mBarcodeQRCode.placeBarcode(over, ColorConstants.RED, 0.5f);
}
}
enter image description here
The size of dots is defined by QR-algorithm itself and depends on the length of the content. Let me modify your code to demonstrate it.
In the snippet below we will add 10 QR codes: the first one will represent just one "Lorem ipsum" sentence, the last one - 10 sentences.
#Test
public void barcodesTest() throws FileNotFoundException {
PdfDocument pdfDocument = new PdfDocument(new PdfWriter(destinationFolder + "barcodes.pdf"));
String text = "Lorem ipsum dolor sit amet";
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 10; i++) {
pdfDocument.addNewPage();
builder.append(text);
addQR(builder.toString(), pdfDocument, i+1);
}
pdfDocument.close();
}
private static void addQR(String text, PdfDocument pdfDoc, int pageNumber) {
PdfPage page = pdfDoc.getPage(pageNumber);
Map<EncodeHintType, Object> mHints = new HashMap<>();
mHints.put(EncodeHintType.ERROR_CORRECTION, ErrorCorrectionLevel.M);
mHints.put(EncodeHintType.MIN_VERSION_NR, 2);
BarcodeQRCode qrCode = new BarcodeQRCode(text, mHints);
PdfCanvas over = new PdfCanvas(page);
// 12 is passed to make the qr code appear big
qrCode.placeBarcode(over, ColorConstants.RED, 12);
}
If we compare the barcodes, we will see that the size of the dots is reduced gradually page by page (in fact, the barcode on the last page is so huge, that it occupies far more space than the page itself):
The first page:
The last page:
Does anybody knows how to use a wavelet transform (Haar or Daubechies) of BoofCV in Java?
There's no code example on the web and I'm very new at this library.
I've tried this code below, but so for no results.
public static void main(String[] args) throws IOException, InstantiationException, IllegalAccessException {
String dir = "C:\\Users\\user\\Desktop\\ImgTest\\";
BufferedImage buff =UtilImageIO.loadImage(dir+"img.png");
ImageUInt16 aux = UtilImageIO.loadImage(dir+"img.png", ImageUInt16.class);
ImageSInt16 input = new ImageSInt16(aux.width, aux.height);
input =ConvertImage.convert(aux, input);
ImageSInt32 output = new ImageSInt32(aux.width, aux.height);
int minPixelValue = 0;
int maxPixelValue = 255;
WaveletTransform<ImageSInt16, ImageSInt32, WlCoef_I32> transform =
FactoryWaveletTransform.create_I(FactoryWaveletHaar.generate(true, 0), 1,
minPixelValue, maxPixelValue,ImageSInt16.class);
output=transform.transform(input, output);
//BufferedImage out = ConvertBufferedImage.convertTo(output, null);
BufferedImage out = new BufferedImage(input.width*6, input.height*6, 5);
out = VisualizeBinaryData.renderLabeled(output, 0, out);
ImageIO.write(out, "PNG", new File(dir+"out.png"));
}
}
I also tried JWave, but there's no code example either and the responsible wasn't able to help with it.
Thank you!
There are a lot of code examples outside there. I made one before, but in C# not Java. For Java Code, you could refer to here
I have a pdf containing 2 blank images. I need to replace both the images with 2 separate images using PDFBox. The problem is, both the blank images appear to have the same resource. So, if I replace one, the other one is replaced with the same image as well.
I followed this example and tried overriding the processOperator() method and replaced the images based on the imageHeight. However, it still ends up replacing both the images with the same image. This is my code thus far:
protected void processOperator( PDFOperator operator, List arguments ) throws IOException
{
String operation = operator.getOperation();
if( INVOKE_OPERATOR.equals(operation) )
{
COSName objectName = (COSName)arguments.get( 0 );
Map<String, PDXObject> xobjects = getResources().getXObjects();
PDXObject xobject = (PDXObject)xobjects.get( objectName.getName() );
if( xobject instanceof PDXObjectImage )
{
PDXObjectImage blankImage = (PDXObjectImage)xobject;
int imageWidth = blankImage.getWidth();
int imageHeight = blankImage.getHeight();
System.out.println("Image width >>> "+imageWidth+" height >>>> "+imageHeight);
// Check if it is blank image 1 based on height
if(imageHeight < 480){
File logo = new File("abc.jpg");
BufferedImage bufferedImage = ImageIO.read(logo);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write( bufferedImage, "jpg", baos );
baos.flush();
byte[] logoImageInBytes = baos.toByteArray();
baos.close();
// label will be used to replace the blank image
label = logoImageInBytes;
}
BufferedImage img = ImageIO.read(new ByteArrayInputStream(label));
BufferedImage resizedImage = Scalr.resize(img, Scalr.Method.BALANCED, Scalr.Mode.FIT_EXACT, img.getWidth(), img.getHeight());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ImageIO.write(resizedImage, "jpg", baos);
// Replace empty image in template with the image generated from shipping label byte array
PDXObjectImage validImage = new PDJpeg(doc, new ByteArrayInputStream(baos.toByteArray()));
blankImage.getCOSStream().replaceWithStream(validImage.getCOSStream());
}
Now, when I remove the if block which checks if (imageHeight < 480), it prints the imageHeight as 30 and 470 for the blank images. However, when I add the if block, it prints the imageHeight as 480 and 1500 and never goes inside the if block because of which both the blank images end up getting replaced by the same image.
What's going on here? I'm new to PDFBox, so I am unsure if my code is correct.
While first thinking about a generic way to actually replace the existing Image by the new Images, I agree with #TilmanHausherr that a more simple solution would be to simply add an extra content stream with two images in the size / position you need covering the existing Image.
This approach is easier to implement (even generically) and less error-prone than actual replacement.
In a generic solution we do not have the Image positions beforehand. To determine them, we can use this helper class (which essentially is a rip-off of the PDFBox example PrintImageLocations):
public class ImageLocator extends PDFStreamEngine
{
private static final String INVOKE_OPERATOR = "Do";
public ImageLocator() throws IOException
{
super(ResourceLoader.loadProperties("org/apache/pdfbox/resources/PDFTextStripper.properties", true));
}
public List<ImageLocation> getLocations()
{
return new ArrayList<ImageLocation>(locations);
}
protected void processOperator(PDFOperator operator, List<COSBase> arguments) throws IOException
{
String operation = operator.getOperation();
if (INVOKE_OPERATOR.equals(operation))
{
COSName objectName = (COSName) arguments.get(0);
Map<String, PDXObject> xobjects = getResources().getXObjects();
PDXObject xobject = (PDXObject) xobjects.get(objectName.getName());
if (xobject instanceof PDXObjectImage)
{
PDXObjectImage image = (PDXObjectImage) xobject;
PDPage page = getCurrentPage();
Matrix matrix = getGraphicsState().getCurrentTransformationMatrix();
locations.add(new ImageLocation(page, matrix, image));
}
else if (xobject instanceof PDXObjectForm)
{
// save the graphics state
getGraphicsStack().push((PDGraphicsState) getGraphicsState().clone());
PDPage page = getCurrentPage();
PDXObjectForm form = (PDXObjectForm) xobject;
COSStream invoke = (COSStream) form.getCOSObject();
PDResources pdResources = form.getResources();
if (pdResources == null)
{
pdResources = page.findResources();
}
// if there is an optional form matrix, we have to
// map the form space to the user space
Matrix matrix = form.getMatrix();
if (matrix != null)
{
Matrix xobjectCTM = matrix.multiply(getGraphicsState().getCurrentTransformationMatrix());
getGraphicsState().setCurrentTransformationMatrix(xobjectCTM);
}
processSubStream(page, pdResources, invoke);
// restore the graphics state
setGraphicsState((PDGraphicsState) getGraphicsStack().pop());
}
}
else
{
super.processOperator(operator, arguments);
}
}
public class ImageLocation
{
public ImageLocation(PDPage page, Matrix matrix, PDXObjectImage image)
{
this.page = page;
this.matrix = matrix;
this.image = image;
}
public PDPage getPage()
{
return page;
}
public Matrix getMatrix()
{
return matrix;
}
public PDXObjectImage getImage()
{
return image;
}
final PDPage page;
final Matrix matrix;
final PDXObjectImage image;
}
final List<ImageLocation> locations = new ArrayList<ImageLocation>();
}
(ImageLocator.java)
In contrast to the example class this helper stores the locations in a list instead of printing them.
We now can cover existing images using code like this:
try ( InputStream resource = getClass().getResourceAsStream("sample.pdf");
InputStream left = getClass().getResourceAsStream("left.png");
InputStream right = getClass().getResourceAsStream("right.png");
PDDocument document = PDDocument.load(resource) )
{
if (document.isEncrypted())
{
document.decrypt("");
}
PDJpeg leftImage = new PDJpeg(document, ImageIO.read(left));
PDJpeg rightImage = new PDJpeg(document, ImageIO.read(right));
// Locate images
ImageLocator locator = new ImageLocator();
List<?> allPages = document.getDocumentCatalog().getAllPages();
for (int i = 0; i < allPages.size(); i++)
{
PDPage page = (PDPage) allPages.get(i);
locator.processStream(page, page.findResources(), page.getContents().getStream());
}
// cover images
for (ImageLocation location : locator.getLocations())
{
// Decide on a replacement
PDRectangle cropBox = location.getPage().findCropBox();
float center = (cropBox.getLowerLeftX() + cropBox.getUpperRightX()) / 2.0f;
PDJpeg image = location.getMatrix().getXPosition() < center ? leftImage : rightImage;
AffineTransform transform = location.getMatrix().createAffineTransform();
PDPageContentStream content = new PDPageContentStream(document, location.getPage(), true, false, true);
content.drawXObject(image, transform);
content.close();
}
document.save(new File(RESULT_FOLDER, "sample-changed.pdf"));
}
(OverwriteImage)
This sample covers all images on the left half of their respective page with left.png and all others with right.png.
I have no implementation or example, but I want to illustrate you a possible way to do what you want by the following steps:
Since you need 2 Images (lets tell them imageA and imageB) in the pdf instead of 1 (which is the blank one). You have to add both of them to the pdf.
save the file temporary - optional, it could work without rewriting the pdf
reopen the file - optional, if you don't need step 2, you also don't need this step
Then replace the blank image with imageA or imageB
Remove the blank image from the pdf
Save the pdf
This question already has answers here:
Writing Arabic with PDFBOX with correct characters presentation form without being separated
(2 answers)
Closed 5 years ago.
Update 1
I'm trying to write some Arabic characters in a pdf document using pdfbox. As a result I get some strange characters. You can find below the code snippet I used for my test. Notice that the same code was used to print Latin characters without any problem.
public static void main(String[] args) throws Exception {
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
document.addPage(page);
PDPageContentStream stream = new PDPageContentStream(document, page,true, true);
//Use of a unicode font
PDFont font = PDTrueTypeFont.loadTTF(document,"C:/arialuni.ttf");
font.setFontEncoding(new WinAnsiEncoding());
stream.setFont(font, 12);
stream.beginText();
stream.moveTextPositionByAmount(40, 600);
stream.drawString("سي ججس ححسيب حسججسيبنم حح ");
stream.endText();
stream.close();
document.save("c:\\resultpdf.pdf");
document.close();
}
Thanks for your help. I tried a Unicode font downloaded from Microsoft website ,but I still have the same result.
Update 2
By using the method 'drawUnicodeString' and the mehod 'loadTTF' I got form the PDFBOX-922
I was able to write arabic charactersm but they are disconnected and ordered from left-to-right. Here are the two methods 'drawUnicodeString' and 'loadTTF'
public void drawUnicodeString(String text) throws IOException {
COSString string = new COSString();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
string.append(c >> 8);
string.append(c & 0xff);
}
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
string.writePDF(buffer);
appendRawCommands(buffer.toByteArray());
appendRawCommands(32);
appendRawCommands(getISOBytes("Tj\n"));
}
public static PDType0Font loadTTF(PDDocument doc, InputStream is)
throws IOException {
/* Load the font which we will convert to Type0 font. */
PDTrueTypeFont pdTtf = PDTrueTypeFont.loadTTF(doc, is);
TrueTypeFont ttf = pdTtf.getTTFFont();
CMAPEncodingEntry unicodeMap = null;
for (CMAPEncodingEntry candidate : ttf.getCMAP().getCmaps()) {
if (candidate.getPlatformId() == CMAPTable.PLATFORM_WINDOWS
&& candidate.getPlatformEncodingId() == CMAPTable.ENCODING_UNICODE) {
unicodeMap = candidate;
break;
}
}
if (unicodeMap == null) {
throw new RuntimeException(
"To use as CIDFont, the TTF must have a Windows platform Unicode encoding");
}
float scaling = 1000f / ttf.getHeader().getUnitsPerEm();
MyPDCIDFontType2Font pdCidFont2 = new MyPDCIDFontType2Font();
pdCidFont2.setBaseFont(pdTtf.getBaseFont());
pdCidFont2.setFontDescriptor((PDFontDescriptorDictionary) pdTtf
.getFontDescriptor());
/* Fixme -- should determine the minimum and maximum charcode in the map */
int[] cid2gid = new int[65536];
List<Float> widths = new ArrayList<Float>();
int[] widthValues = ttf.getHorizontalMetrics().getAdvanceWidth();
for (int i = 0; i < cid2gid.length; i++) {
int glyph = unicodeMap.getGlyphId(i);
cid2gid[i] = glyph;
widths.add((float) i);
widths.add((float) i);
widths.add(widthValues[glyph] * scaling);
}
pdCidFont2.setCidToGid(cid2gid);
pdCidFont2.setWidths(widths);
pdCidFont2.setDefaultWidth(widths.get(0).longValue());
/* Now construct the type0 font that we actually return */
myType0Font pdFont0 = new myType0Font();
pdFont0.setDescendantFont(pdCidFont2);
pdFont0.setDescendantFonts(new COSObject(pdCidFont2.getCOSObject()));
pdFont0.setEncoding(COSName.IDENTITY_H);
pdFont0.setBaseFont(pdTtf.getBaseFont());
// pdfont0.setToUnicode(COSName.IDENTITY_H); XXX how to express identity
// mapping as ToUnicode program? */
return pdFont0;
}
and here are the characters printed :
I don't know why these characters are disconnected
Arabic can be written by applying both PDFBOX-922 and PDFBOX-1287 .(the diff files are attached to in issues description)
I hope that the patches will be applied in the version 2.0.
i suggest you try adding ICU4J jars to your project :
ICU4J