PDF Compression - HTML to PDF (wkhtmltopdf) - java

Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?

You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.

Related

Need help choosing a robust archive format

I'm working on a Java application that runs on Lubuntu on single-board computers and produces thousands of image files, which are then transferred over FTP. The transfer takes several times longer for multiple files than it does for a single file of the same size as the total of the multiple files, I'm assuming because the FTP client has to establish a new connection for every file. So I thought I'd have the application put the image files in a single archive file, but the problem with this is that sometimes the SBC won't shut down cleanly for various reasons, and the entire archive may be corrupted all the images will be lost. Archiving the files afterwards is not a great option basically because it takes a long time. An intermediate solution may be to create multiple midsize archives, but I'm not happy with it.
I wrote a simple unit test to experiment with ZipOutputStream, and if I cancel the test it before it closes the stream, the resulting zip file gets corrupted, unsurprisingly. Could anyone suggest a different widely recognized archive format and/or implementation that might be more robust?
The tar format, jtar implementation seem to work pretty well. If I cancel in the middle of writing, I can still open the archive at least with 7zip and even get the partially written last entry.

Looking for an efficient file caching system

I'm currently developing an MMO which utilizes numerous sprites (image files), and I plan to store these files in a compressed state on the user's hard drive. I was wondering if there already exists an implementation of an efficient, directory-based cache system, in which I can utilize to store these image files in different folders that can compress into either one file or multiple files. I was also researching LZ4 (de)compression, and I suppose that would be useful as well, but that does not solve the directory issue.
Thanks!
EDIT: For example, one file should hold numerous image files.
If something like this does not exist, what would the fastest way be to compress multiple image files into one file, and then decompress to load them into memory when the program starts?

TrueZip Random Access Functionality

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

How to get pixel rgb value in hadoop?

I have millions of images stored in hdfs of hadoop. I want to build a index of these images. How to get pixel rgb values of these images? I am new in hadoop, the image format in hadoop is different from the original image binary format. Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency? Many thanks.
I could answer the problem partially.
Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency?
Depends on the size of the individual files. If the individual files are really big, then consolidating them might not really help and the other way also.
Check this query on SO for more details.
If you have the additional storage and efficiency is important to you I would definitely go with a SequenceFile. Hadoop will handle splitting the file up for you. We ran into a case where we were extracting data from imagery file similar to what you are doing. In our case we were extracting metadata for ingestion in a discovery system so that our imagery files could be searched outside of the cluster. In this case because efficiency was not a big deal for us we just process the files individually making sure to make them not splittable. This way the other system can reach back over http to grab the source files.

Storing lots of small files: archive vs. filesystem

I am creating an application that requires a lot of image thumbnails (~3000, 5-25KB). Because speed is essential I plan on loading these images into memory when the application starts. At runtime, new thumbnails will be downloaded and added to the collective.
I could store them all in a folder, but reading thousands of files into memory when a program starts hardly seems efficient.
My second option would be to save them in some kind of (compressed) archive. This would make storage itself and loading more efficient (I think). However, new files will be added regularly, and that will probably not go as smoothly as just saving them in a folder.
Is storing a cache of small files in a (compressed) archive a bad idea or not? Are ZIP files the way to go? Would I be better off using uncompressed archives (and if so, what kind)?
All image files will be JPEG's.
Thanks in advance!
EDIT: I am considering to drop the "load everything into memory on application start" thing. This would simplify my question a little. My initial idea to put everything in one big file now seems less beneficial, since the problem of many files in one directory can be solved by hashing into subdirectories.
Small files don't compress especially well, so you may not gain much compression.
While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.
I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.
I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.
New images could be added at the end, and the index updated appropriately.
It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.
I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.
All disk based storage, and most database, allocate space in chunks. The chunks on large capacity disks can be large. If you have 5kb files and a 32kb disk chunk you end up with 85% wasted space on your storage.
Using an archive won't compress jpeg much because the jpeg encoding algorithm already does that. It will however save you the wasted space on the storage media. It does make things more complicated and perhaps a little slower.
In my opinion I think that the zip file way it´s a bad idea, because you will slowdown everything with the process to load the zip file and unzip it to extract each image.
I think that the purpose of a thumbnail image is that by nature is small so your app plus hardware can load it as fast as possible. So I believe that it is a better idea to load each image as you need it.
Well, if you have small, "geometric" pictures, you may implement them as objects of type javax.swing.Icon rather than images to load from the filesystem.
http://download.oracle.com/javase/6/docs/api/javax/swing/Icon.html
http://download.oracle.com/javase/tutorial/uiswing/components/icon.html
So you will implement one or more objects which draw themselves onto a Graphics surface using the Graphics drawing primitives, instead of copying pixels.
If this is a web-application then the best performance boost you can get is setting good HTTP caching headers. Having a unique URL for every image (also different URLs for different versions of the same image) makes it possible to set VERY far future expire headers, because changing the image changes the URL leading into refetch.
I won't compress, because JPEG cannot be good compressed and it only costs CPU time.
I would recommend to simply store the images into filesystem and consider the use of libraries like jawr or implement your own caching strategy.
I know this question has already answered but I think you need more options other than zipping.
While zip is good, It's not really affect much for JPEG since JPEG has already compressed.
Other thing you may want to consider is :
Put the image in Content Delivery Network (CDN)
Compress components with gzip ( mean the server will automatically zip every response ) and you dont need to write any code to unzip it later - it's handled by the browser automatically.
Since you mention JPEG, you may want to use JPEGTran.Run jpegtran on all your JPEGs.
This tool does lossless JPEG operations such as rotation and can also be used to optimize and remove comments and other useless information (such as EXIF information) from your images.
jpegtran -copy none -optimize -perfect src.jpg dest.jpg
Use Image Sprites. Instead of asking browser to download many image at same time, ask the browser to only download one.
For the details read : http://developer.yahoo.com/performance/rules.html#opt_images
For the basic examination how to improve your website performance you can try install YSlow ( plugin to detect uneffecient code ) in Firefox.
Hope that helps.

Categories