how sohronit page pdf file in byte byte [] and restore back - java

I need to parse a PDF file through the pages and load each separately into a byte[]. I use the itext library.
I download a file consisting of one page with this code:
public Document addPageInTheDocument(String namePage, MultipartFile pdfData, Long documentId) throws IOException {
notNull(namePage, INVALID_PARAMETRE);
notNull(pdfData, INVALID_PARAMETRE);
notNull(documentId, INVALID_PARAMETRE);
byte[] in = pdfData.getBytes(); // size file 88747
Page page = new Page(namePage);
Document document = new Document();
document.setId(documentId);
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfData.getBytes()));
PdfDocument pdfDocument = new PdfDocument(reader);
if (pdfDocument.getNumberOfPages() != 1) {
throw new IllegalArgumentException();
}
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
page.setPageData(pdfDocument.getPage(1).getContentBytes());
return addPageInTheDocument(document, page);
}
I'm trying to restore the file with this code:
ByteBuffer byteContent = new ByteBuffer() ;
for (Map.Entry<String, Page> page : pages.entrySet()) {
byteContent.append(page.getValue().getPageData());
}
PdfWriter writer = new PdfWriter(new FileOutputStream(book.getName() + modification + FORMAT));
byte[] df = byteContent.toByteArray();
PdfReader reader = new PdfReader(new ByteArrayInputStream(byteContent.toByteArray()));
com.itextpdf.layout.Document itextDocument = new com.itextpdf.layout.Document(new PdfDocument(reader, writer));
itextDocument.close();
Why is there such a difference in size?
And why the files and pages, and both the byte[] to create the file?

Let's start with your size question:
byte[] in = pdfData.getBytes(); // size file 88747
...
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
...
Why are there such a difference in size?
Because PdfPage.getContentBytes() does not return what you expect.
You seem to expect it to return a complete representation of the contents of the given page, and the Javadocs of that method might be interpreted ("Get decoded bytes for the whole page content.") to mean that.
This is not the case. PdfPage.getContentBytes() returns the contents of the content stream(s) of the page. These content stream(s) contain a sequence of commands which build the page. But these commands take parameters which reference data outside the content stream, e.g.:
when text is drawn on a PDF page, the content stream contains an operation selecting a font but the data describing the font and in case of embedded fonts the font program itself are outside the content stream;
when bitmap images are drawn, the content stream usually contains an operation for it which references image data outside the content stream;
there are operations which reference so called xobjects which essentially are independent content streams which can be called upon from any page; these xobject are not contained in the page content stream either.
Furthermore there are annotations (e.g. form fields) with their own content streams which are stored in separate structures. And lots of page properties are outside, too.
Thus, there are such differences in size because you get only a minute part of the page definition using getContentBytes.
Now let's look at your code "restoring the file".
As a corollary of the above it is obvious that your code merely concatenates some content streams but does not provide the external resources these streams refer to.
But aside from that your code also points out a misunderstanding concerning the nature of PDF pages: they are not merely blobs you can split and concatenate again as you want. They are collections of PDF objects which are spread throughout the PDF file; different pages can share some of their objects (e.g. fonts of often used images).
What you can do instead...
As representations of a single page you should use a PDF containing the data referenced by that single page. The iText example Burst.java shows how to do that.
To join these single page PDFs again you can use an iText PdfMerger. Remember to set smart mode (PdfWriter.setSmartMode(true)) to prevent resource duplication in the result.

Related

GSON / iText: Extract Text From PDF 1.7 byte[]

I'm automating tests using Rest-Assured and GSON - and need to validate the contents of a PDF file that is returned in the response of a POST request. The content of the files vary and can contain anything from just text, to text and tables, or text and tables and graphics. Every page can, and most likely will be different as far a glyph content. I am only concerned with ALL text on the pdf page - be it just plain text, or text inside of a table, or text associated with (or is inside of) an image. Since all pdf's returned by the request are different, I cannot define search areas (as far as I know). I just need to extract all text on the page.
I extract the pdf data into a byte array like so:
Gson pdfGson = new Gson();
byte[] pdfBytes =
pdfGson.fromJson(this.response.as(JsonObject.class)
.get("pdfData").getAsJsonObject().get("data").getAsJsonArray(), byte[].class);
(I've tried other extraction methods for the byte[], but this is the only way I've found that returns valid data.) This returns a very large byte[] like so:
[37, 91, 22, 45, 23, ...]
When I parse the array I run into the same issue as This Question (except my pdf is 1.7) and I attempt to implement the accepted answer, adjusted for my purposes and as explained in the documentation for iText:
byte[] decodedPdfBytes = PdfReader.decodeBytes(pdfBytes, new PdfDictionary(), FilterHandlers.getDefaultFilterHandlers());
IRandomAccessSource source = new RandomAccessSourceFactory().createSource(decodedPdfBytes);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ReaderProperties readerProperties = new ReaderProperties();
// Ineffective:
readerProperties.setPassword(user.password.getBytes());
PdfReader pdfReader = new PdfReader(source, readerProperties);
// Ineffective:
pdfReader.setUnethicalReading(true);
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter(baos));
for(int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i));
System.out.println(text);
}
This DOES decode the pdf page, and return text, but it is only the header text. No other text is returned.
For what it's worth, on the front end, when the user clicks the button to generate the pdf, it returns a blob containing the download data, so I'm relatively sure that the metadata is GSA encoded, but I'm not sure if that matters at all. I'm not able to share an example of the pdf docs due to sensitive material.
Any point in the right direction would be greatly appreciated! I've spent 3 days trying to find a solution.
For those looking for a solution - ultimately we wound up going a different route. We never found a solution to this specific issue.

how to add metadata to an image (with java code) and then convert it to dicom

I found a java code that converts a jpg and a Dicom(it takes the metadata friĀ”om that one) files to a final Dicom one. What I want to do is convert the jpg image into a Dicom one, generating the metadata with java code.
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
// Convert the image to a byte array
DataBuffer buff = jpg.getData().getDataBuffer();
DataBufferUShort buffer = new DataBufferUShort(buff.getSize());
for (int i = 0; i < buffer.getSize(); ++i)
buffer.setElem(i, buff.getElem(i));
short[] data = buffer.getData();
ByteBuffer byteBuf = ByteBuffer.allocate(2 * data.length);
int i = 0;
while (data.length > i) {
byteBuf.putShort(data[i]);
i++;
}
// Copy a header
DicomInputStream dis = new DicomInputStream(new File("fileToCopyheaderFrom.dcm"));
Attributes meta = dis.readFileMetaInformation();
Attributes attribs = dis.readDataset(-1, Tag.PixelData);
dis.close();
// Change the rows and columns
attribs.setInt(Tag.Rows, VR.US, jpg.getHeight());
attribs.setInt(Tag.Columns, VR.US, jpg.getWidth());
System.out.println(byteBuf.array().length);
// Write the file
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
DicomOutputStream dcmo = new DicomOutputStream(new File("myDicom.dcm"));
dcmo.writeFileMetaInformation(meta);
attribs.writeTo(dcmo);
dcmo.close();
I am not expert in toolkit (and of-course Java as well).
Your "// Copy a header" section reads the source DICOM file and holds all the attributes in Attributes attribs variable.
Then, your "// Change the rows and columns" section modifies few attributes as per need.
Then, your "// Write the file" section simply add the attributes read from source file to destination file.
Now, you want to bypass the source DICOM file and convert plain JPEG to DICOM with adding attributes yourself.
Replace your "// Copy a header" section to build the instance of Attributes.
Attributes attribs = new Attributes();
attribs.setString(Tag.StudyDate, VR.DA, "20110404");
attribs.setString(Tag.StudyTime, VR.TM, "15");
The tags mentioned in above example are for example only. You have to decide yourself which tags you want to include. Note that specifications have defined Types 1, 1C, 2, 2C and 3 for tags depending on the SOP class you are dealing with.
While adding the tags, you have to take care of correct VR as well. Specifications talk about that thing as well.
I cannot explain all this here; too broad.
I cannot help about dcm4che, but if using another Java DICOM library is an option for you, this task is quite simple using DeCaMino (http://dicomplugin.com) :
BufferedImage jpg = ImageIO.read(new File("myjpg.jpg"));
DicomWriter dw = new DicomWriter();
dw.setOutput(new File("myjpg.dcm"));
DicomMetadata dmd = new DicomMetadata();
dw.write(dmd, new IIOImage(jpg, null, null), null);
This will write a DICOM conform file with SOP class "secondary capture" and default metadata.
To customize the metadata, add data elements to dmd before writing, e.g. :
DataSet ds = dmd.getDataSet();
ds.set(Tag.StudyDate, LocalDate.of(2011, 4, 4));
ds.set(Tag.StudyTime, LocalTime.of(15, 0, 0));
You can also change the transfer syntax (thus controlling the pixel data encoding) :
dw.setTransferSyntax(UID.JPEG2000TS);
Disclaimer: I'm the author of DeCaMino.
EDIT: As kritzel_sw says, I'll strongly advice against modifying and existing DICOM object by changing pixel data and some data element, you'll mostly end with a non-conform object. Better is to write an object from scratch, and the simplest objects are from the secondary capture class. DeCaMino helps you by generating a conform secondary capture object with mandatory data elements, but it won't help you to generate a modality (like a CT acquisition) object.
Just a side note:
attribs.setBytes(Tag.PixelData, VR.OW, byteBuf.array());
VR.OW means 16 bits per pixel/channel. Since you are replacing the pixel data with pixel data read from a JPEG image, and you named the buffer "byteBuf", I suspect that this is inconsistent. VR.OB is the value representation for 8 bits per pixel/channel image.
Talking about channels, I understand that you want to make construction of a DICOM object easy by modifying an existing DICOM image rather than creating a new one from scratch. However, color pixel data is not appropriate for all types of DICOM images. E.g. if your fileToCopyheaderFrom.dcm is a Radiography, CT or MRI image (or many other radiology types), it is not allowed to add color pixel data to it.
Furthermore, each image contains identifying information (Study-, Series-, SOP Instance UID are the most important ones) which should be replaced by newly generated values.
I understand that it appears appealing to modify an existing DICOM object with new pixel data, but this process is probably much more complicated than you would expect it to be. In both cases, it is inevitable to learn basic DICOM concepts.

Extracting font size and boldness from PDF with PDF2Dom

PDF2Dom (based on the PDFBox library) is capable of converting PDFs to HTML format preserving such characteristics like font size, boldness etc. Example of this conversation is shown below:
private void generateHTMLFromPDF(String filename) {
PDDocument pdf = PDDocument.load(new File(filename));
Writer output = new PrintWriter("src/output/pdf.html", "utf-8");
new PDFDomTree().writeText(pdf, output);
output.close();}
I'm trying to parse an existing PDF and extract these characteristics on a line for line basis and I wonder if there are any existing methods within the PDF2Dom/PDFBox parse these right from the PDF?
Another method would be to just use the HTML output and proceed from there but it seems like an unnecessary detour.

How to create an input stream from a list of byte[] where each byte[] represent one page of a document (say pdf)

From my Java code I am calling a Document Service to get a document from it based on some input parameters. The mime type of the document could be pdf, text or image. The service response is in the form of List<PageContent> where each PageContent has byte[] which represent one page of the document. Now I want to create an input stream for the whole document.
So I thought of collating all the pages like following:
List<PageContent> pages = ....//Response from the service
ByteArrayOutputStream bos = new ByteArrayOutputStream();
for(PageContent page : pages) {
byte[] data = page.getData();
bos.write(data);
}
InputStream is = new ByteArrayInputStream(bos.toByteArray());
I still have doubt that it is the right way or not. Should the logic of collating all page to form a single input stream be different based on mime type?
What kind of details do I need to get from Document Service provider which will help me do this?

iText java: streaming of a concatenated pdf

So I am generating a 500+ page document by concatenating 250 individual pdfs of 2 pages each. Each pdf is a form which has fields that data is placed into.
I would like to stream content to the browser two pages at a time.
Currently I am waiting for the entire document to be generated before sending anything to the browser and this is causing some timeout problems.
This is a simplified form of what I am doing:
PdfCopyFields pdfCopier = new PdfCopyFields(outputStream);
for(Receipt receipt: receipts) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
generator.generateSaleDocument(receipt.getId(), baos);
pdfCopier.addDocument(new PdfReader(baos.toByteArray()));
}
pdfCopier.close();
outputStream.flush();
How do I stream these PDFs instead of sending them as one big bunch?

Categories