How to reduce size of RTF with embedded images?

How to reduce size of RTF with embedded images? - java

We have some code which produces an RTF document from a RTF template. It is basically doing string search and replaces of special tags within the RTF file. This is accessible via a web page.
Typically, the processing time for this is really quick.
However, we need to embed an image within a template. We've been embedding these as JPEG images using Word's "Insert/Picture/From File..." functionality. But we've found that the resultant RTF file size is massively dependant upon the image.
For example, I've inserted a 20k JPEG logo (which is basically a solid background with some text). The RTF file increased in size from around 390k (without the image) to 510k (with the image).
Then we inserted a JPEG containing a screenshot, i.e. the image contains text, multiple colours, etc. The JPEG is around 150k. Using this image, the RTF file increased in size from 390k to 3.5MB.
So the encoding that Word uses for storing images into an RTF doesn't perform linearly. I'm guessing it is dependant upon what is in the JPEG image.
I need to keep the size of the RTF templates to a minimum to try and keep our file processing times to a minimum.
Does anyone have any ideas on how to minimize the size of the RTF files with embedded images?
Is there any way of controlling the encoding that Word uses? I can't see any options anywhere.
Does anyone know what type of binary encoding Word/RTF uses?
Thanks in advance.

Here is the best solution
http://support.microsoft.com/kb/224663
Excerpt:
SYMPTOMS
When you save a Microsoft Word document that contains an EMF,
PNG, GIF, or JPEG graphic as a different file format (for example,
Word 6.0/95 (.doc) or Rich Text Format (.rtf)), the file size of the
document may dramatically increase.
For example, a Microsoft Word 2000 document that contains a JPEG
graphic that is saved as a Word 2000 document may have a file size of
45,568 bytes (44.5KB). However, when you save this file as Word 6.0/95
(.doc) or as Rich Text Format (.rtf), the file size may grow to
1,289,728 bytes (1.22MB).
CAUSE
This functionality is by design in Microsoft Word. If an
EMF, a PNG, a GIF, or a JPEG graphic is inserted into a Word document,
when the document is saved, two copies of the graphic are saved in the
document. Graphics are saved in the applicable EMF, PNG, GIF, or JPEG
format and are also converted to WMF (Windows Metafile) format.
RESOLUTION
Warning If you use
Registry Editor incorrectly, you may cause serious problems that may
require you to reinstall your operating system. Microsoft cannot
guarantee that you can solve problems that result from using Registry
Editor incorrectly. Use Registry Editor at your own risk.
To prevent Word from saving two copies of the graphic in the document,
and to reduce the file size of the document, add the
ExportPictureWithMetafile=0 string value to the Microsoft Windows
registry.

An image in an RTF file gets stored as a WMF, uncompressed. On mac, it it would be macpict. Your best bet to keep the file size down is to link the image to the document rather than insert a copy in the document. The trade-off is that you have to keep the files together.
EDIT
Is compressing the RTF an option? Using zip/rar, you'll get your file size back, but you'll have to uncompress, first obviously. There are supposed to be tools that will do rtf compression, but I have never used them.

We have done a similar project over at work. Only we're not using that "Insert/Picture/From File..." functionality. Our template has a tag named [photos], as I presume your own does also. When we process the document we replace the tag with the RTF codes needed to display images. We're putting them within a table and we're displaying two images on each row, plus a row on top for the title.
So, you might place a tag [photos] in your template. Then you replace the tag with the RTF Codes. You can find some good references to these codes on the web. For eg. here
.
Now, my code looks something like this:
\par {\rtf1\ansi\deff0{\trowd\cellx8810 {title}\intbl\qc\cell\row}{\trowd\cellx4405\cellx8810{\pict\jpegblip\picwgoal4000\pichgoal3000\piccropl-50\piccropr-50\piccropt-50\piccropb-50\hex
Your image as an array of bytes in hexadecimal }\intbl\cell{\pict\jpegblip\picwgoal4000\pichgoal3000\piccropl-50\piccropr-50\piccropt-50\piccropb-50\hex
Your other image }\intbl\cell\row}
if you get your image into a byte array, you may use BitConverter.ToString(array) to get your hex code. only you'll need to replace dashes "-" by "";
Our files will take up less than 1/10th of the space a "normal" RTF will. If we open the doc's code with an editor such as Notepad++, we can see the RTF codes, but if we open the document and save it as RTF (changing its name), it'll go from 1.5Mb to 50Mb!!
I'm guessing DaveParillo's reply justifies it: I'm only writing each image once.
Hope it helps.
Cheers mate

Initially, keep in mind that each byte is stored using 2 characters (two bytes), this means that the increments at least is the double size of original picture.
Other things that you need is that Word and Word Pad insert different (flavor or format) of the same image plus other fields (that RTF can to be displayed without them).
Here are some scripts used to insert images in RTF (https://joseluisbz.wordpress.com/2011/06/22/script-de-clases-rtf-para-jsp-y-php/), and one example of use (https://joseluisbz.wordpress.com/2011/07/16/subiendo-imagenes-png-y-jpg-y-archivos-a-mysql-con-php-y-jsp-y-mostrarlos-en-rtf-usando-clases/)
Now, maybe you will need replace the original Image with another (http://joseluisbz.wordpress.com/2013/07/26/exploring-a-wmf-file-0x000900/).

The Swartbees answer worked perfectly for me. I first reduced the image quality to "0" using G.I.M.P. Save as jpeg functionality. After following the microsoft solution suggested by Swartbees above I reinserted the picture into the file and the size increase was negligible 229k to 279k (as opposed to 29000kb).
Thanks for your suggestions guys.

Yes, by removing the redundant characters. And to do this you must insert them back into your stream.
For instance if you have over twenty f characters in one line, then you can replace with f[20] in your stream. It is a start.
-Best of luck.

Related

IText7: Creating Highlight for an PdfImageXObject

We converted our documents to PDF from tiff. This created PdfImageXObjects within the resulting PDF. The users markup the PDFs with using software provided by our document management system client. The markups are stored outside of the PDF itself so they are really just screen overlays. My problem is that when I attempt to merge the markups with the PDFs the highlights are dull since they must be transparent to show the text underneath. One option is to extract PdfImageXObjects as bitmaps and convert the b/w images to RGB and perform a bitwise operation to replace the whitespace and not the text, a time consuming prospect. Is there a way do this that is less time and resource intensive?

How to idendify the Text in image files and also how to read that text?

In the image format have full of the text. (ie) the scanned document in the format of image file *.tiff. Optical character recognize method only the Normal format of alphabet. In this image format contains the text like running letter. so how to identify and convert the text in to text files?

With tesseract-ocr you can train for the characters. If you are sure with running letter font you can use those samples as the training data instead of the default one which ships with it. I haven t done with running letter, but this library is a good place to start with.
http://code.google.com/p/tesseract-ocr/
Regards,
Prasanna.

Text extraction from PDF file, with differents sizes and colors fonts, using PDFBOX - Java

I'm developing a java application that extracts text from a pdf file, for this reason I use PDFBox library. I can extract most of the text from the file, but I can not extract certain words that have a larger size and a different color.
I have tried to use the method setAverageCharTolerance of pdfstripper class, but I have not had any results.
I wonder if anyone knows of any way to get extract all the text from the pdf file, whether this is large or small or different colors?

Is it possible to include extra information in an image file that won't corrupt it?

I'm making a Java program that I'd like to be able to add extra information to an image file (just plain text, that could later be read by the same program). I was considering adding it a s the first line.
Is there a way to add information like this w/o corrupting the image file. Can I somehow "comment" off that text so it wouldn't be read as part of the image binary (like when you open an image up in Notepad, it would show up there).

This is tagged as Java, but in general, you can modify the EXIF data of an image file.
See also: http://frickelblog.wordpress.com/2009/08/21/java-library-for-reading-writing-exif-xmp-and-iptc-in-jpegs/

Most image file formats allow arbitrary metadata. PNG image contain pixel data, color profiles, copyright etc etc etc in "chunks" - see http://www.libpng.org/pub/png/spec/1.2/PNG-Chunks.html for descriptions. You probably want a iTXt or tEXt chunk.
JPEG also allows metadata. Read up on JFIF and EXIF. See this answer: https://stackoverflow.com/a/10699613/10468
See the question Writing image metadata in Java, preferably PNG for some Java-specific answers.
There might also be useful information for you in Find JPEG resolution with PHP although that question asks about PHP not Java.
TIFF isn't seen as much these days, but that format allows for a wide variety of extra data and user-defined tags for any data you like to store.
There are no image file format I know of that you can open in a text editor or print at the command line, which will show as text metadata and not gibberish from the pixels. However, it is possible to define such a file format - if the file starts off in text, has an EOT (ascii 03) (iirc) then continues in binary, most text command line tools and possibly text editors will deal with the text fine and stop at the EOT. I'm not sure that works on all platforms.

You can use Image Stagenography for hiding text. Here are some links which can help you in doing this : Images' Steganography
Steganography - Hiding messages in the Noise of a Picture

Yes, it is possible. I explain the algorithm:
You need to make a mask to make the last bit of the lowest significant bits of your image to zero. Do it first with one bit and see the impact on your image. Then change two last bits to zero and see how your image is impacted. Continue this to figure out how many bits you can turn to zero without any impact on the quality of your image.
Get the binary content of your text message.
Calculate the length of the binary content of your text message. You need this to save this value in the first few bytes of your image. You will later use it to know how many pixels you need to read to retrieve your message.
Here is the main part:
Let's say if you are making only the last bit of your image to zero, and you reserve the first 16 bits for your message length, then you will need the first 16 pixels reserved to save the length of your message. Save the binary value of your message length, for example bin(114), in the last bit of the first 16 pixels. You may need to pad zeros to right to make it match.
From pixel 17, start saving your message bit by bit in the last bit of every pixel.
To retrieve the embedded message, read the first 16 pixels, construct the length to know how many pixels you need to read to reconstruct your message.
You can even save another image or voice signal inside your image. I know this post is tagged Java, but to give you an idea you can have a look at my python code implementing the same thing in this_github_link, in file "stagenography_embed_text_in_image.py". Also I have explained the algorithm to save other types of information in this stackoverflow_post.
Hope it makes it clear.

Alternative to latex / a way to typeset good looking documents from Java to PDF

I'm working on application in Java that will maintain database of song lyrics in plain text and print out some songbooks/chordbooks(that is create PDF file from selected songs). I was planing that the Java application will generate source code for pdflatex and after compiling this source user will get PDF file.
Lately I've run into a lot of problems because of latex limitation: fixed memory size (some pictures will also be drawn to PDF) - error when exceeded, no way to query end of line or and of page dynamically, it's very hard to override latex placement algorithm in a complex way,... see also some my other questions regarding latex. I come to conclusion that latex is not good option for automated PDF generation.
So I need replacement. I need to be able to typeset:
Chords over lyrics when the lyrics are in variable char width so I need to be able to measure text width
Chord diagrams that means I'll have to draw quite complex pictures
Each song on separate double page
Different fonts etc.
Thanks for all answers

Here are some PDF open source APIs
http://java-source.net/open-source/pdf-libraries
This has been asked many time, You might want to look at this post

IText is a free library which offers lots of capabilities for creating PDFs programmatically.

Rather than try to manage/calculate the complexities of the desired layout, you could try Docmosis. It will let you layout a document as a template using doc or odt formats. This means if you could make a doc or odt look like you want, you can turn it into a template and get Docmosis to render it as a PDF. Text and images can be placed inside or outside tables which makes layout fairly easy to manage.

ConTeXt is another TeX system, but it is easier to control the layout than with LaTeX. For drawing you could use PGF/TikZ or MetaPost. Support for both is available in ConTeXt. With ConTeXt's built in Lua scripting you could draw the chords automatically, assuming you have them stored in some sort of data structure.

why not just use lilypond with latex? it's meant for typesetting music.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.