Inline Image vs Temporary Files (Java XHTML->PDF generation) - java

I have a project where I need to generate a PDF file. Within this PDF I have to insert a body of text as well as four or five large images (roughly 800px*1000px). In order to make this flexible I have opted to use FreeMarker in conjunction with XHTMLRenderer (flying-saucer).
I am now faced with a couple of options:
Create the images and save them as temporary files to disk. Then process an .xhtml template with FreeMarker (saving it to disk) and pass the processed .xhtml file URL to XHTMLRenderer to generate the PDF. All these created files (bar the PDF) would be created with File.createTempFile. This would allow FreeMarker to pick the images up off the disk (as if they were images linked in the XHTML)
Process the .xhtml template and keep it in memory. Pass the images to the template as base64 encoded data urls. This would remove the need for saving any temporary files as the output from FreeMarker could be passed directly to XHTMLRenderer.
Base64 Encoded Image Url example (a small folder icon):
<img src="
/ge8WSLf/rhf/3kdbW1mxsbP//mf///yH5BAAAAAAALAAAAAAQAA4AAARe8L1Ekyky67QZ1hLnjM5UUde0ECwLJoExK
cppV0aCcGCmTIHEIUEqjgaORCMxIC6e0CcguWw6aFjsVMkkIr7g77ZKPJjPZqIyd7sJAgVGoEGv2xsBxqNgYPj/gAwXEQA7" />
My main question is which would be a better technique? Is creating lots of temporary files bad (does it carry lots of overhead)? Could I potentially run out of memory creating such large base64 encoded strings?

I found myself asking the same question recently. After some benchmarking, it turns out the data URI approach was the best bet.
Storing a bunch of Base64-encoded images can be expensive. But the overhead for creating temp files, streaming image data in, then waiting for XHTMLRenderer hit that temp file 4 times before cleaning it up is also taxing.
In my experiments, the Base64 images proved to be a better approach. That being said, I'm not sure to what extent it will remain true for larger images. In my case, I was testing with 32x32 icons, 80x80 logos, 400x240 bar graphs and one 600x400 graphic. The difference in overhead was significant with everything except the 600x400 graphic, where it got really negligible.
(A side note for Joop Eggen- In my case, PDF generation is time critical. The user clicks a button the PDF and expects the download to begin immediately.)

PDF generation is not time critical - one might even considering throtling the communication. Embedding images in Base64 costs a bit more CPU and memory in an already costly templating transformation: the Base64 buld data is dragged through the templating pipeling, then probably decoded from Base64 to binary to be compressed. I even was unaware that embedded images are possible. So the overhead of temp files is a more sure solution. Certainly to start with. Of course one can benchmark both cases.

Related

PDF font subsetting and subset merging in Java

I have a part in my code where I am programatically filling out PDF forms using iText Java based on user-entered data, and then I concat a number of such PDFs into one using iText again.
The PDF forms that are getting merged can be (and usually are) different.
The resulting PDF is way too large - looking at it, 98% of the space is taken by fonts.
The way I understand it, what happens is that the individual PDF forms have different font subsets, so when I merge them, I get massive amount of duplicate glyphs, except that the subsets are not identical, so I can't get rid of them without merging the subsets.
The other problem is that the PDF forms themselves might not even contain subsets, but heavily packed fonts that have 2000+ glyphs, so even if I manage to leave only one instance of that font in the PDF, that still can be many megabytes. Hence it seems that I need to be able to 1) create and 2) merge existing font subsets.
The quirk is that I do not control neither the PDF forms (that are being filled out) nor their number, nor the order in which they are concatenated, so it is not possible to solve this by controlling what kind of fonts are embedded in them.
Adobe Acrobat can of course solve such a problem - it can create and also merge font subsets - but I need a programatic, server-side solution. According to google hits, iText cannot do this. Is there another library that I could use (or anything else I can do)?

Dynamic Image Caching with Java

I have a servlet with an API that delivers images from GET requests. The servlet creates a data file of CAD commands based on the parameters of the GET request. This data file is then delivered to an image parser, which creates an image on the file system. The servlet reads the image and returns the bytes on the response.
All of the IO and the calling of the image parser program can be very taxing and images of around 80kb are rendering in 3-4000ms on a local system.
There are roughly 20 parameters that make up the GET request. Each correlates to a different portion of the image. So, the combinations of possible images is extremely large.
To alleviate the loading time, I plan to store BLOBs of rendered images in a database. If a GET request matches one previously executed, I will pull from cache. Else, I will render a new one. This does not fix "first-time" run, but will help "n+1 runs".
Any other ideas on how I can improve performance?
you can store file on you disk,and image path in database,because database storage is usually more expensive than file system storage.
sort the http get parameters and hash them as an index to that image record for fast query by parameters.
to make sure your program not crush when disk capacity not enough,you should remove the the unused or rarely used record:
store a lastAccessedTime for each record,updated each time when the image is requested.
using a scheduler to check lastAccessedTime,removing records which is lower than a specified weight.
you can use different strategy to calculate the weight,such as lastAccessedTime,accessedCount,image size,etc.
You can turn all the parameters that you feed into the rendering pipeline into a single String in a predictable way such that you can compute a SHA1 hash of the input then store the output file in a directory with the SHA1 as the file name, that way if you get a request with the same parameters you just compute the hash then check if the file is on disk if it is return it otherwise send the work to the render pipeline and create the file.
If you have a lot of files you might want to use more than one directory, maybe look at how git divides up files across directories by the first few chars of the SHA1 for inspiration.
I use a similar setup on my app I am not doing rendering just storing files, the files are stored in the db but for performance reasons I serve them out from disk using the sha1 hash of the file contents as the filename / URI for the file.

Custom Binary Input - Hadoop

I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.
These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].
What is a good Input format for these kinds of files? I thought two possible solutions:
Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?
Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.
How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.

How to Combine Images without loading them into RAM in Java

I have a very large (around a gigapixel) image I am trying to generate, and so far I can only create images up to around 40 megapixels in a BufferedImage before I get an out of memory error. I want to construct the image piece by piece, then combine the pieces without loading the images into memory. I could also do this by writing each piece to a file, but ImageIO does not support this.
I think JAI can help you build what you want. I would suggest looking at the data structures and streams offered by JAI.
Also, have a look at these questions, might help you with ideas.
How to save a large fractal image with the least possible memory footprint
How to create a big image file from many tiles
Appending to an Image File
You basically want to reverse 2 there.
Good luck with your project ;)
Not a proper solution, just a sketch.
Unpacking a piece of image is not easy when an image is compressed. You can decompress, by an external tool, the image into some trivial format (xpm, uncompressed tiff). Then you could load pieces of this image as byte arrays, because the format is so straightforward, and create Image instances out of these raw data.
I see two easy solutions. Create a custom binary format for your image. For saving, just generate one part at a time, seek() to the appropriate spot in the file, then offload your data. For loading, seek() to the appropriate spot in the file, then load your data.
The other solution is to learn an image format yourself. bmp is uncompressed, but the only easy one to learn. Once learned, the above steps work quite well.
Remember to convert your image to a byte array for easy storage.
If there is no way to do it built into Java (for your sake I hope this is not the case and that someone answers saying so), then you will need to implement an algorithm yourself, just as others have commented here saying so.
You do not necessarily need to understand the entire algorithm yourself. If you take a pre-existing algorithm, you could just modify it to load the file as a byte stream, create a byte buffer to keep reading chunks of the file, and modify the algorithm to accept this data a chunk at a time.
Some algorithms, such as jpg, might not be possible to implement with a linear stream of file chunks in this manner. As #warren suggested, bmp is probably the easiest to implement in this way since that file format just has a header of so many bytes then it just dumps the RGBA data straight out in binary format (along with some padding). So if you were to load up your sub-images that need to be combined, loading them logically 1 at a time (though you could actually multithread this thing and load the next data concurrently to speed it up, as this process is going to take a long time), reading the next line of data, saving that out to your binary output stream, and so on.
You might even need to load the sub-images multiple times. For example, imagine an image being saved which is made up of 4 sub-images in a 2x2 grid. You might need to load image 1, read its first line of data, save that to your new file, release image 1, load image 2, read its first line of data, save, release 2, load 1 to read its 2nd line of data, and so on. You would be more likely to need to do this if you use a compressed image format for saving in.
To suggest a bmp again, since bmp is not compressed and you can just save the data in whatever format you want (assuming the file was opened in a manner which provides random access), you could skip around in the file you're saving so that you can completely read 1 sub-image and save all of its data before moving on to the next one. That might provide run time savings, but it might also provide terrible saved file sizes.
And I could go on. There are likely to be multiple pitfalls, optimizations, and so on.
Instead of saving 1 huge file which is the result of combining other files, what if you created a new image file format which was merely made up of meta-data allowing it to reference other files in a way which combined them logically without actually creating 1 massive file? Whether or not creating a new image file format is an option depends on your software; if you are expecting people to take these images to use in other software, then this would not work - at least, not unless you could get your new image file format to catch on and become standard.

How to display an image which is in bytes to JSP page using HTML tags?

I have ByteArrayOutputStream which contains a JPEG image in bytes. My requirement is to display that image in a JSP page (to display the image in the frontend using HTML tags). How do I do that?
I have referred the BufferedImage class, but it is confusing for me because I am new to this.
If the image is not too big, you can do it as follows.
<img src="..." />
Where iVBORw0KGgoAAAANS... is the Base64 encoded bytes.
Base64 encoding can be done with a library, like Ostermiller Java Utilities' Base64 Java Library or org.apache.commons.codec.binary.Base64.
Unless you use a "data" URI (useful for small images), the browser will make two requests: one for the HTML and one for the image. You need to be able to output an img tag which includes enough information to let you respond to the subsequent request for the image with the data in your ByteArrayOutputStream.
Depending on how you got that JPEG file and how your server scales out, that might involve writing the image to disk, caching it in memory, regenerating it, or any combination of these.
If you can delay the image generation until the browser requests the actual image in the first place, that's pretty ideal. That may involve putting extra parameters in the URL for the image - such as points on a graph, or the size of thumbnail to generate, or whatever your image is.
If you're new to both JSP and HTML, I strongly recommend you concentrate on the HTML side first. Work out what you need to serve and what the browser will do before you work out how to serve it dynamically. Start with static pages and files for the HTML and images, and then work out how to generate them instead.

Categories