I'm automating tests using Rest-Assured and GSON - and need to validate the contents of a PDF file that is returned in the response of a POST request. The content of the files vary and can contain anything from just text, to text and tables, or text and tables and graphics. Every page can, and most likely will be different as far a glyph content. I am only concerned with ALL text on the pdf page - be it just plain text, or text inside of a table, or text associated with (or is inside of) an image. Since all pdf's returned by the request are different, I cannot define search areas (as far as I know). I just need to extract all text on the page.
I extract the pdf data into a byte array like so:
Gson pdfGson = new Gson();
byte[] pdfBytes =
pdfGson.fromJson(this.response.as(JsonObject.class)
.get("pdfData").getAsJsonObject().get("data").getAsJsonArray(), byte[].class);
(I've tried other extraction methods for the byte[], but this is the only way I've found that returns valid data.) This returns a very large byte[] like so:
[37, 91, 22, 45, 23, ...]
When I parse the array I run into the same issue as This Question (except my pdf is 1.7) and I attempt to implement the accepted answer, adjusted for my purposes and as explained in the documentation for iText:
byte[] decodedPdfBytes = PdfReader.decodeBytes(pdfBytes, new PdfDictionary(), FilterHandlers.getDefaultFilterHandlers());
IRandomAccessSource source = new RandomAccessSourceFactory().createSource(decodedPdfBytes);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ReaderProperties readerProperties = new ReaderProperties();
// Ineffective:
readerProperties.setPassword(user.password.getBytes());
PdfReader pdfReader = new PdfReader(source, readerProperties);
// Ineffective:
pdfReader.setUnethicalReading(true);
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter(baos));
for(int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i));
System.out.println(text);
}
This DOES decode the pdf page, and return text, but it is only the header text. No other text is returned.
For what it's worth, on the front end, when the user clicks the button to generate the pdf, it returns a blob containing the download data, so I'm relatively sure that the metadata is GSA encoded, but I'm not sure if that matters at all. I'm not able to share an example of the pdf docs due to sensitive material.
Any point in the right direction would be greatly appreciated! I've spent 3 days trying to find a solution.
For those looking for a solution - ultimately we wound up going a different route. We never found a solution to this specific issue.
I found a java code that converts a jpg and a Dicom(it takes the metadata friĀ”om that one) files to a final Dicom one. What I want to do is convert the jpg image into a Dicom one, generating the metadata with java code.
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
// Convert the image to a byte array
DataBuffer buff = jpg.getData().getDataBuffer();
DataBufferUShort buffer = new DataBufferUShort(buff.getSize());
for (int i = 0; i < buffer.getSize(); ++i)
buffer.setElem(i, buff.getElem(i));
short[] data = buffer.getData();
ByteBuffer byteBuf = ByteBuffer.allocate(2 * data.length);
int i = 0;
while (data.length > i) {
byteBuf.putShort(data[i]);
i++;
}
// Copy a header
DicomInputStream dis = new DicomInputStream(new File("fileToCopyheaderFrom.dcm"));
Attributes meta = dis.readFileMetaInformation();
Attributes attribs = dis.readDataset(-1, Tag.PixelData);
dis.close();
// Change the rows and columns
attribs.setInt(Tag.Rows, VR.US, jpg.getHeight());
attribs.setInt(Tag.Columns, VR.US, jpg.getWidth());
System.out.println(byteBuf.array().length);
// Write the file
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
DicomOutputStream dcmo = new DicomOutputStream(new File("myDicom.dcm"));
dcmo.writeFileMetaInformation(meta);
attribs.writeTo(dcmo);
dcmo.close();
I am not expert in toolkit (and of-course Java as well).
Your "// Copy a header" section reads the source DICOM file and holds all the attributes in Attributes attribs variable.
Then, your "// Change the rows and columns" section modifies few attributes as per need.
Then, your "// Write the file" section simply add the attributes read from source file to destination file.
Now, you want to bypass the source DICOM file and convert plain JPEG to DICOM with adding attributes yourself.
Replace your "// Copy a header" section to build the instance of Attributes.
Attributes attribs = new Attributes();
attribs.setString(Tag.StudyDate, VR.DA, "20110404");
attribs.setString(Tag.StudyTime, VR.TM, "15");
The tags mentioned in above example are for example only. You have to decide yourself which tags you want to include. Note that specifications have defined Types 1, 1C, 2, 2C and 3 for tags depending on the SOP class you are dealing with.
While adding the tags, you have to take care of correct VR as well. Specifications talk about that thing as well.
I cannot explain all this here; too broad.
I cannot help about dcm4che, but if using another Java DICOM library is an option for you, this task is quite simple using DeCaMino (http://dicomplugin.com) :
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
DicomWriter dw = new DicomWriter();
dw.setOutput(new File("myjpg.dcm"));
DicomMetadata dmd = new DicomMetadata();
dw.write(dmd, new IIOImage(jpg, null, null), null);
This will write a DICOM conform file with SOP class "secondary capture" and default metadata.
To customize the metadata, add data elements to dmd before writing, e.g. :
DataSet ds = dmd.getDataSet();
ds.set(Tag.StudyDate, LocalDate.of(2011, 4, 4));
ds.set(Tag.StudyTime, LocalTime.of(15, 0, 0));
You can also change the transfer syntax (thus controlling the pixel data encoding) :
dw.setTransferSyntax(UID.JPEG2000TS);
Disclaimer: I'm the author of DeCaMino.
EDIT: As kritzel_sw says, I'll strongly advice against modifying and existing DICOM object by changing pixel data and some data element, you'll mostly end with a non-conform object. Better is to write an object from scratch, and the simplest objects are from the secondary capture class. DeCaMino helps you by generating a conform secondary capture object with mandatory data elements, but it won't help you to generate a modality (like a CT acquisition) object.
Just a side note:
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
VR.OW means 16 bits per pixel/channel. Since you are replacing the pixel data with pixel data read from a JPEG image, and you named the buffer "byteBuf", I suspect that this is inconsistent. VR.OB is the value representation for 8 bits per pixel/channel image.
Talking about channels, I understand that you want to make construction of a DICOM object easy by modifying an existing DICOM image rather than creating a new one from scratch. However, color pixel data is not appropriate for all types of DICOM images. E.g. if your fileToCopyheaderFrom.dcm is a Radiography, CT or MRI image (or many other radiology types), it is not allowed to add color pixel data to it.
Furthermore, each image contains identifying information (Study-, Series-, SOP Instance UID are the most important ones) which should be replaced by newly generated values.
I understand that it appears appealing to modify an existing DICOM object with new pixel data, but this process is probably much more complicated than you would expect it to be. In both cases, it is inevitable to learn basic DICOM concepts.
PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.
From my Java code I am calling a Document Service to get a document from it based on some input parameters. The mime type of the document could be pdf, text or image. The service response is in the form of List<PageContent> where each PageContent has byte[] which represent one page of the document. Now I want to create an input stream for the whole document.
So I thought of collating all the pages like following:
List<PageContent> pages = ....//Response from the service
ByteArrayOutputStream bos = new ByteArrayOutputStream();
for(PageContent page : pages) {
byte[] data = page.getData();
bos.write(data);
}
InputStream is = new ByteArrayInputStream(bos.toByteArray());
I still have doubt that it is the right way or not. Should the logic of collating all page to form a single input stream be different based on mime type?
What kind of details do I need to get from Document Service provider which will help me do this?
So I am generating a 500+ page document by concatenating 250 individual pdfs of 2 pages each. Each pdf is a form which has fields that data is placed into.
I would like to stream content to the browser two pages at a time.
Currently I am waiting for the entire document to be generated before sending anything to the browser and this is causing some timeout problems.
This is a simplified form of what I am doing:
PdfCopyFields pdfCopier = new PdfCopyFields(outputStream);
for(Receipt receipt: receipts) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
generator.generateSaleDocument(receipt.getId(), baos);
pdfCopier.addDocument(new PdfReader(baos.toByteArray()));
}
pdfCopier.close();
outputStream.flush();
How do I stream these PDFs instead of sending them as one big bunch?