GSON / iText: Extract Text From PDF 1.7 byte[]

GSON / iText: Extract Text From PDF 1.7 byte[] - java

I'm automating tests using Rest-Assured and GSON - and need to validate the contents of a PDF file that is returned in the response of a POST request. The content of the files vary and can contain anything from just text, to text and tables, or text and tables and graphics. Every page can, and most likely will be different as far a glyph content. I am only concerned with ALL text on the pdf page - be it just plain text, or text inside of a table, or text associated with (or is inside of) an image. Since all pdf's returned by the request are different, I cannot define search areas (as far as I know). I just need to extract all text on the page.
I extract the pdf data into a byte array like so:
Gson pdfGson = new Gson();
byte[] pdfBytes =
pdfGson.fromJson(this.response.as(JsonObject.class)
.get("pdfData").getAsJsonObject().get("data").getAsJsonArray(), byte[].class);
(I've tried other extraction methods for the byte[], but this is the only way I've found that returns valid data.) This returns a very large byte[] like so:
[37, 91, 22, 45, 23, ...]
When I parse the array I run into the same issue as This Question (except my pdf is 1.7) and I attempt to implement the accepted answer, adjusted for my purposes and as explained in the documentation for iText:
byte[] decodedPdfBytes = PdfReader.decodeBytes(pdfBytes, new PdfDictionary(), FilterHandlers.getDefaultFilterHandlers());
IRandomAccessSource source = new RandomAccessSourceFactory().createSource(decodedPdfBytes);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ReaderProperties readerProperties = new ReaderProperties();
// Ineffective:
readerProperties.setPassword(user.password.getBytes());
PdfReader pdfReader = new PdfReader(source, readerProperties);
// Ineffective:
pdfReader.setUnethicalReading(true);
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter(baos));
for(int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i));
System.out.println(text);
}
This DOES decode the pdf page, and return text, but it is only the header text. No other text is returned.
For what it's worth, on the front end, when the user clicks the button to generate the pdf, it returns a blob containing the download data, so I'm relatively sure that the metadata is GSA encoded, but I'm not sure if that matters at all. I'm not able to share an example of the pdf docs due to sensitive material.
Any point in the right direction would be greatly appreciated! I've spent 3 days trying to find a solution.

For those looking for a solution - ultimately we wound up going a different route. We never found a solution to this specific issue.

Related

Problem with PDFTextStripper().getText when using PDType0Font in pdfbox

I've started to work with PDType0Font recently (we've used PDType1Font.HELVETICA but needed unicode support) and I'm facing an error where i'm adding lines to the file using PDPageContentStream but PDFTextStripper.getText doesn't get the updated file contents.
I'm loading the font:
PDType0Font.load(document, fontFile)
And creating the contentStream as follows:
PDPageContentStream(document, pdPage, PDPageContentStream.AppendMode.PREPEND, false)
my function that adds content to the pdf is:
private fun addTextToContents(contentStream: PDPageContentStream, txtLines: List<String>, x: Float, y: Float, pdfFont: PDFont, fontSize: Float, maxWidth: Float) {
contentStream.beginText()
contentStream.setFont(pdfFont, fontSize)
contentStream.newLineAtOffset(x, y)
txtLines.forEach { txt ->
contentStream.showText(txt)
contentStream.newLineAtOffset(0.0F, -fontSize)
}
contentStream.endText()
contentStream.close()
When i'm trying to read the content of the file using PDFTextStripper.getText i'm getting the file before the changes.
However, if I'm adding document.save before reading to PDFTextStripper, it works.
val txt: String = PDFTextStripper().getText(doc) //not working
doc.save(//File)
val txt: String = PDFTextStripper().getText(doc) //working
if I'm using PDType1Font.HELVETICA in
contentStream.setFont(pdfFont, fontSize)
Everything is working without any problems and without saving the doc before reading the text.
I'm suspecting that the issue is with the code in PDPageContentStream.showTextInternal():
// Unicode code points to keep when subsetting
if (font.willBeSubset())
{
int offset = 0;
while (offset < text.length())
{
int codePoint = text.codePointAt(offset);
font.addToSubset(codePoint);
offset += Character.charCount(codePoint);
}
}
This is the only thing that is not the same when using PDType0Font with embedsubsets and PDType1Font.
Can someone help with this?
What am I doing wrong?

Your question, in particular the quoted code, already hints at the answer to your question:
When using a font that will be subset (font.willBeSubset() == true), the associated PDF objects are unfinished until the file is saved. Text extraction on the other hand needs the finished PDF objects to properly work. Thus, don't apply text extraction to a document that is still being created and uses fonts that will be subset.
You describe your use case as
for our unit tests, we are adding text (mandatory text for us) to the document and then using PDFTextStripper we are validating that the file has the proper fields.
As Tilman proposes: Then it would make more sense to save the PDF, and then to reload. That would be a more realistic test. Not saving is cutting corners IMHO.
Indeed, in unit tests you should first produce the final PDF as it will be sent out (i.e. saving it, either to the file system or to memory), then reload that file, and test only this reloaded document.

how to add metadata to an image (with java code) and then convert it to dicom

I found a java code that converts a jpg and a Dicom(it takes the metadata fri¡om that one) files to a final Dicom one. What I want to do is convert the jpg image into a Dicom one, generating the metadata with java code.
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
// Convert the image to a byte array
DataBuffer buff = jpg.getData().getDataBuffer();
DataBufferUShort buffer = new DataBufferUShort(buff.getSize());
for (int i = 0; i < buffer.getSize(); ++i)
buffer.setElem(i, buff.getElem(i));
short[] data = buffer.getData();
ByteBuffer byteBuf = ByteBuffer.allocate(2 * data.length);
int i = 0;
while (data.length > i) {
byteBuf.putShort(data[i]);
i++;
}
// Copy a header
DicomInputStream dis = new DicomInputStream(new File("fileToCopyheaderFrom.dcm"));
Attributes meta = dis.readFileMetaInformation();
Attributes attribs = dis.readDataset(-1, Tag.PixelData);
dis.close();
// Change the rows and columns
attribs.setInt(Tag.Rows, VR.US, jpg.getHeight());
attribs.setInt(Tag.Columns, VR.US, jpg.getWidth());
System.out.println(byteBuf.array().length);
// Write the file
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
DicomOutputStream dcmo = new DicomOutputStream(new File("myDicom.dcm"));
dcmo.writeFileMetaInformation(meta);
attribs.writeTo(dcmo);
dcmo.close();

I am not expert in toolkit (and of-course Java as well).
Your "// Copy a header" section reads the source DICOM file and holds all the attributes in Attributes attribs variable.
Then, your "// Change the rows and columns" section modifies few attributes as per need.
Then, your "// Write the file" section simply add the attributes read from source file to destination file.
Now, you want to bypass the source DICOM file and convert plain JPEG to DICOM with adding attributes yourself.
Replace your "// Copy a header" section to build the instance of Attributes.
Attributes attribs = new Attributes();
attribs.setString(Tag.StudyDate, VR.DA, "20110404");
attribs.setString(Tag.StudyTime, VR.TM, "15");
The tags mentioned in above example are for example only. You have to decide yourself which tags you want to include. Note that specifications have defined Types 1, 1C, 2, 2C and 3 for tags depending on the SOP class you are dealing with.
While adding the tags, you have to take care of correct VR as well. Specifications talk about that thing as well.
I cannot explain all this here; too broad.

I cannot help about dcm4che, but if using another Java DICOM library is an option for you, this task is quite simple using DeCaMino (http://dicomplugin.com) :
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
DicomWriter dw = new DicomWriter();
dw.setOutput(new File("myjpg.dcm"));
DicomMetadata dmd = new DicomMetadata();
dw.write(dmd, new IIOImage(jpg, null, null), null);
This will write a DICOM conform file with SOP class "secondary capture" and default metadata.
To customize the metadata, add data elements to dmd before writing, e.g. :
DataSet ds = dmd.getDataSet();
ds.set(Tag.StudyDate, LocalDate.of(2011, 4, 4));
ds.set(Tag.StudyTime, LocalTime.of(15, 0, 0));
You can also change the transfer syntax (thus controlling the pixel data encoding) :
dw.setTransferSyntax(UID.JPEG2000TS);
Disclaimer: I'm the author of DeCaMino.
EDIT: As kritzel_sw says, I'll strongly advice against modifying and existing DICOM object by changing pixel data and some data element, you'll mostly end with a non-conform object. Better is to write an object from scratch, and the simplest objects are from the secondary capture class. DeCaMino helps you by generating a conform secondary capture object with mandatory data elements, but it won't help you to generate a modality (like a CT acquisition) object.

Just a side note:
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
VR.OW means 16 bits per pixel/channel. Since you are replacing the pixel data with pixel data read from a JPEG image, and you named the buffer "byteBuf", I suspect that this is inconsistent. VR.OB is the value representation for 8 bits per pixel/channel image.
Talking about channels, I understand that you want to make construction of a DICOM object easy by modifying an existing DICOM image rather than creating a new one from scratch. However, color pixel data is not appropriate for all types of DICOM images. E.g. if your fileToCopyheaderFrom.dcm is a Radiography, CT or MRI image (or many other radiology types), it is not allowed to add color pixel data to it.
Furthermore, each image contains identifying information (Study-, Series-, SOP Instance UID are the most important ones) which should be replaced by newly generated values.
I understand that it appears appealing to modify an existing DICOM object with new pixel data, but this process is probably much more complicated than you would expect it to be. In both cases, it is inevitable to learn basic DICOM concepts.

how sohronit page pdf file in byte byte [] and restore back

I need to parse a PDF file through the pages and load each separately into a byte[]. I use the itext library.
I download a file consisting of one page with this code:
public Document addPageInTheDocument(String namePage, MultipartFile pdfData, Long documentId) throws IOException {
notNull(namePage, INVALID_PARAMETRE);
notNull(pdfData, INVALID_PARAMETRE);
notNull(documentId, INVALID_PARAMETRE);
byte[] in = pdfData.getBytes(); // size file 88747
Page page = new Page(namePage);
Document document = new Document();
document.setId(documentId);
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfData.getBytes()));
PdfDocument pdfDocument = new PdfDocument(reader);
if (pdfDocument.getNumberOfPages() != 1) {
throw new IllegalArgumentException();
}
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
page.setPageData(pdfDocument.getPage(1).getContentBytes());
return addPageInTheDocument(document, page);
}
I'm trying to restore the file with this code:
ByteBuffer byteContent = new ByteBuffer() ;
for (Map.Entry<String, Page> page : pages.entrySet()) {
byteContent.append(page.getValue().getPageData());
}
PdfWriter writer = new PdfWriter(new FileOutputStream(book.getName() + modification + FORMAT));
byte[] df = byteContent.toByteArray();
PdfReader reader = new PdfReader(new ByteArrayInputStream(byteContent.toByteArray()));
com.itextpdf.layout.Document itextDocument = new com.itextpdf.layout.Document(new PdfDocument(reader, writer));
itextDocument.close();
Why is there such a difference in size?
And why the files and pages, and both the byte[] to create the file?

Let's start with your size question:
byte[] in = pdfData.getBytes(); // size file 88747
...
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
...
Why are there such a difference in size?
Because PdfPage.getContentBytes() does not return what you expect.
You seem to expect it to return a complete representation of the contents of the given page, and the Javadocs of that method might be interpreted ("Get decoded bytes for the whole page content.") to mean that.
This is not the case. PdfPage.getContentBytes() returns the contents of the content stream(s) of the page. These content stream(s) contain a sequence of commands which build the page. But these commands take parameters which reference data outside the content stream, e.g.:
when text is drawn on a PDF page, the content stream contains an operation selecting a font but the data describing the font and in case of embedded fonts the font program itself are outside the content stream;
when bitmap images are drawn, the content stream usually contains an operation for it which references image data outside the content stream;
there are operations which reference so called xobjects which essentially are independent content streams which can be called upon from any page; these xobject are not contained in the page content stream either.
Furthermore there are annotations (e.g. form fields) with their own content streams which are stored in separate structures. And lots of page properties are outside, too.
Thus, there are such differences in size because you get only a minute part of the page definition using getContentBytes.
Now let's look at your code "restoring the file".
As a corollary of the above it is obvious that your code merely concatenates some content streams but does not provide the external resources these streams refer to.
But aside from that your code also points out a misunderstanding concerning the nature of PDF pages: they are not merely blobs you can split and concatenate again as you want. They are collections of PDF objects which are spread throughout the PDF file; different pages can share some of their objects (e.g. fonts of often used images).
What you can do instead...
As representations of a single page you should use a PDF containing the data referenced by that single page. The iText example Burst.java shows how to do that.
To join these single page PDFs again you can use an iText PdfMerger. Remember to set smart mode (PdfWriter.setSmartMode(true)) to prevent resource duplication in the result.

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

Wrong parsing with iText's PdfTextExtractor

I'm facing a problem when I try to read the content of a PDF document. I'm using iText 2.1.7 with Java, and I need to analyze the content of a PDF document: at first I was using the PdfTextExtractor's getTextFromPage method and it was working right, but only when the page is just text, if it contains an image, then the String that I get with the getTextFromPage is a set of meaningless symbols (maybe a different character encoding?), and I lose the content of the whole page. I tried with the last version of iText and works fine, but if I'm not wrong the license wouldn't be totally free (I'm working in a web application for a commercial customer, which serves PDFs on the fly) so I can't use it. I would really appreciate if you have any suggestion.
In case you need it, here is the code:
PdfReader pdf = new PdfReader(doc); //doc is just a byte[]
int pageCount = pdf.getNumberOfPages();
for (int i = 1; i <= pageCount; i++) {
PdfTextExtractor pdfTextExtractor = new PdfTextExtractor(pdf);
String pageText = pdfTextExtractor.getTextFromPage(i);
Thanks in advance, regards.

I think that you PDF has an inline image. I do not think that iText 2.1.7 will deal with that.
You can find information regarding the license here

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.