Storing lots of small files: archive vs. filesystem

Storing lots of small files: archive vs. filesystem - java

I am creating an application that requires a lot of image thumbnails (~3000, 5-25KB). Because speed is essential I plan on loading these images into memory when the application starts. At runtime, new thumbnails will be downloaded and added to the collective.
I could store them all in a folder, but reading thousands of files into memory when a program starts hardly seems efficient.
My second option would be to save them in some kind of (compressed) archive. This would make storage itself and loading more efficient (I think). However, new files will be added regularly, and that will probably not go as smoothly as just saving them in a folder.
Is storing a cache of small files in a (compressed) archive a bad idea or not? Are ZIP files the way to go? Would I be better off using uncompressed archives (and if so, what kind)?
All image files will be JPEG's.
Thanks in advance!
EDIT: I am considering to drop the "load everything into memory on application start" thing. This would simplify my question a little. My initial idea to put everything in one big file now seems less beneficial, since the problem of many files in one directory can be solved by hashing into subdirectories.

Small files don't compress especially well, so you may not gain much compression.
While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.
I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.
I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.
New images could be added at the end, and the index updated appropriately.
It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.
I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.

All disk based storage, and most database, allocate space in chunks. The chunks on large capacity disks can be large. If you have 5kb files and a 32kb disk chunk you end up with 85% wasted space on your storage.
Using an archive won't compress jpeg much because the jpeg encoding algorithm already does that. It will however save you the wasted space on the storage media. It does make things more complicated and perhaps a little slower.

In my opinion I think that the zip file way it´s a bad idea, because you will slowdown everything with the process to load the zip file and unzip it to extract each image.
I think that the purpose of a thumbnail image is that by nature is small so your app plus hardware can load it as fast as possible. So I believe that it is a better idea to load each image as you need it.

Well, if you have small, "geometric" pictures, you may implement them as objects of type javax.swing.Icon rather than images to load from the filesystem.
http://download.oracle.com/javase/6/docs/api/javax/swing/Icon.html
http://download.oracle.com/javase/tutorial/uiswing/components/icon.html
So you will implement one or more objects which draw themselves onto a Graphics surface using the Graphics drawing primitives, instead of copying pixels.

If this is a web-application then the best performance boost you can get is setting good HTTP caching headers. Having a unique URL for every image (also different URLs for different versions of the same image) makes it possible to set VERY far future expire headers, because changing the image changes the URL leading into refetch.
I won't compress, because JPEG cannot be good compressed and it only costs CPU time.
I would recommend to simply store the images into filesystem and consider the use of libraries like jawr or implement your own caching strategy.

I know this question has already answered but I think you need more options other than zipping.
While zip is good, It's not really affect much for JPEG since JPEG has already compressed.
Other thing you may want to consider is :
Put the image in Content Delivery Network (CDN)
Compress components with gzip ( mean the server will automatically zip every response ) and you dont need to write any code to unzip it later - it's handled by the browser automatically.
Since you mention JPEG, you may want to use JPEGTran.Run jpegtran on all your JPEGs.
This tool does lossless JPEG operations such as rotation and can also be used to optimize and remove comments and other useless information (such as EXIF information) from your images.
jpegtran -copy none -optimize -perfect src.jpg dest.jpg
Use Image Sprites. Instead of asking browser to download many image at same time, ask the browser to only download one.
For the details read : http://developer.yahoo.com/performance/rules.html#opt_images
For the basic examination how to improve your website performance you can try install YSlow ( plugin to detect uneffecient code ) in Firefox.
Hope that helps.

Related

Looking for an efficient file caching system

I'm currently developing an MMO which utilizes numerous sprites (image files), and I plan to store these files in a compressed state on the user's hard drive. I was wondering if there already exists an implementation of an efficient, directory-based cache system, in which I can utilize to store these image files in different folders that can compress into either one file or multiple files. I was also researching LZ4 (de)compression, and I suppose that would be useful as well, but that does not solve the directory issue.
Thanks!
EDIT: For example, one file should hold numerous image files.
If something like this does not exist, what would the fastest way be to compress multiple image files into one file, and then decompress to load them into memory when the program starts?

Speed optimization of Website with a lot of images

I am currently working on a website which involves a lot of images. The problem is all the images are uploaded by the user so I can't do anything to alter the images. The website runs quiet ok on local system but the speed drops too much on the server,it becomes too slow

I'd suggest you to use Timthumb. It creates a thumbnail by generating a URL on the fly and uses very minimal disk space.

If the users of your website are uploading the images, then I presume (there must be) an upload script. Inside of that script or directly after its execution you could compress or rescale the image to size needed on the website, shortening loading time. There is a PHP image processing library called ImageMagick here:
http://php.net/manual/en/book.imagick.php
There is the PHP GD image processing library here:
http://php.net/manual/en/book.image.php
I don't have much personal experience with them, but from my knowledge it looks like one will do the job. Off the top of my head, that's the best solution I can think of, and hopefully it works. There is not a lot you can change about your problem if you don't compress/scale the images, and these are probably your best options. Wish you the best.

How can I compress a batch of images?

We are constantly transferring gigabytes of compressed Tiffs overseas and it takes a long time for each batch of images to transfer. It is not uncommon for a batch to take over 6 hours to transfer. I would like to reduce the time to transfer a batch of images.
I understand that videos compress really well because most of the time each frame is generally very similar to the one before it and compression algorithms take advantage of that. In our scenario, the images often look similar to one another. Are there any image compression libraries I can use to take advantage of the fact that there is a lot of redundancy across images? Ideally I would want lossless compression.
Would it work if I turned the images into a video before transferring them and then turned them back to images on the other side? If this would work, what libraries would you recommend? I need to be able to call this from Java and preferably run it on Linux, but the library does not need to be written in Java. Windows could also be a possibility.

What I would try first:
Start from uncompressed tiffs (otherwise, it will be hard to find similarities).
tar them together (so they are contained within a single file, can be a specific range of images off course).
Then use a compression algorithm of your choice to see which one yields the best results (on the single file).
Easy enough to try out without much effort. How well it works depends on the source images themselves (and the compression algorithm used).
Alternative approach if the above does not yield enough results:
Make sure you have all uncompressed images.
Send over the first image.
Do a binary diff (or maybe diffing the hexdump) towards the next image.
Send over the diff file and apply it at the receiving end to reconstruct the image.
Repeat 3-4 for every image.
I personally don't think you will easily get good (lossless) results by using video compression algorithms (after all, they are specifically tailored to a different purpose).

Adding Many Saved Images in Android App

The application I'm trying to build will have a lot of images displayed (in ImageViews), and I'm not fetching them from a server/online service as it will need to be used offline. I know I can just dump them in the res/drawable directories, but I was wondering if there's any way to optimize this. Is there a way to somehow compress these images (besides making them smaller, they're already as small as I need) or use some other sort of android tool to better store them locally on the device?
I could just be overlooking a well used feature, and if so, it'd be great if someone could point me to that.
Edit: If I were to compress the images somehow, I would need to decompress at runtime or something, and that would take another thread/loading time. I'm not sure how to do that either, so I'm just brainstorming various ways, and I thought someone here would've come across this at some point.

If you haven't already, this is a good read - http://developer.android.com/guide/practices/ui_guidelines/icon_design.html#design-tips
When saving image assets, remove unnecessary metadata
Although the Android SDK tools will automatically compress PNGs when
packaging application resources into the application binary, a good
practice is to remove unnecessary headers and metadata from your PNG
assets. Tools such as OptiPNG or Pngcrush can ensure that this
metadata is removed and that your image asset file sizes are
optimized.
Outside of all other compression logic the above would be the place to start. Also when you say "optimize" - do you mean optimize the way images/drawables are loaded in your app or just the amount of space (on disk) the app will consume?

How to get pixel rgb value in hadoop?

I have millions of images stored in hdfs of hadoop. I want to build a index of these images. How to get pixel rgb values of these images? I am new in hadoop, the image format in hadoop is different from the original image binary format. Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency? Many thanks.

I could answer the problem partially.
Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency?
Depends on the size of the individual files. If the individual files are really big, then consolidating them might not really help and the other way also.
Check this query on SO for more details.

If you have the additional storage and efficiency is important to you I would definitely go with a SequenceFile. Hadoop will handle splitting the file up for you. We ran into a case where we were extracting data from imagery file similar to what you are doing. In our case we were extracting metadata for ingestion in a discovery system so that our imagery files could be searched outside of the cluster. In this case because efficiency was not a big deal for us we just process the files individually making sure to make them not splittable. This way the other system can reach back over http to grab the source files.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.