Looking for an efficient file caching system

Looking for an efficient file caching system - java

I'm currently developing an MMO which utilizes numerous sprites (image files), and I plan to store these files in a compressed state on the user's hard drive. I was wondering if there already exists an implementation of an efficient, directory-based cache system, in which I can utilize to store these image files in different folders that can compress into either one file or multiple files. I was also researching LZ4 (de)compression, and I suppose that would be useful as well, but that does not solve the directory issue.
Thanks!
EDIT: For example, one file should hold numerous image files.
If something like this does not exist, what would the fastest way be to compress multiple image files into one file, and then decompress to load them into memory when the program starts?

Related

Does saving files in a ".zip" folder speed up file write time to network drive?

I know that when I write a new file to a folder that ends in ".zip" it compresses the file. This is when using BufferedOutputStream in JAVA and saving to a windows file system. I'm saving these files to a network drive, so the write time is dependent on network speed.
Will saving to a .zip folder speed up write time? In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file? Sorry if this is an ignorant question.

There are so many misconceptions in the Question, I think it is worth going through them one at a time.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file.
That is not correct. Creating a file with a ".zip" suffix does not automatically make it compressed. Writing files to a directory that has ".zip" as its filename suffix (?!?) doesn't either. Not in Java. Not in other languages.
In order to get compression, the application needs to take steps to make this happen. In Java you could use ZipOutputStream to write a file in ZIP file format. However, a ZIP file is actually an "archive" format that is designed to hold multiple files in a ZIP file. If you simply trying to compress a single file, there are better alternatives; e.g. GZIPOutputStream.
(It is also possible that this so-called "ZIP folder" you are talking about is a normal ZIP file that has been "mounted" as a loopback file system. You / someone else would have had to set that up explicitly. Anyhow, if this is what is going on here, it is nothing to do with Java. It is all happening in external software and in the operating system where the ZIP is "mounted".)
This is when using BufferedOutputStream in JAVA and saving to a windows file system.
Erm ... no. See above. However you are correct that it may be better to use a BufferedOutputStream to write files, though it only really helps if your application is writing the files in small chunks; e.g. a byte at a time. (Stream compression complicates the issue, so it is difficult to give a simple, general answer on this.)
I'm saving these files to a network drive, so the write time is dependent on network speed.
Correct. It is also dependent on network latency, the protocols used and the load on the remote file server. (If you have a ZIP "mounted", then that is going to add overheads too.)
Will saving to a .zip folder speed up write time?
Maybe. See above. It depends what you mean by a ZIP folder.
Ignoring that, writing the files (the right way) in compressed and / or archive form from Java may speed up writes. There are actually two things to consider:
For plain compression, you are trading off the time it takes the application (!!) to compress and decompress the data against the time (and disk space) you are saving by moving and storing less bytes.
For ZIP files (and similar archive formats) there is a second potential saving. Storing and retrieving lots of individual small files from a file system is slow compared with storing and retrieving a single ZIP file containing those files.
And if you are looking for optimal compression, then ZIP is not the best option.
In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file?
There are so many variables that it is hard to say for sure. But unless you have done something odd, it is likely that the bytes are sent over the network in compressed form.
Finally, I would advise you NOT to try to combine mounted ZIP files and network shares:
The combination of the two could potentially interact in ways that makes performance worse.
There is a risk that you will end up with a corrupted ZIP or lost files if the network share goes offline at an inconvenient point.

PDF Compression - HTML to PDF (wkhtmltopdf)

Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?

You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.

Practical Use to Temp Files

What would be a practical use for temporary files (see code below)?
File temp = File.createTempFile("temp-file-name", ".tmp");
Why can't you store the data you would keep in the file in some variables? If the file is (probably) going to be deleted on the program exit (as "temp" implies), why even create them?
An example can be such as when downloading a file, it often appears as a temporary file while the downloading completes.

The two reasons I know of:
As storage space for large chunks of memory you don't need at the moment, when doing memory-intensive tasks like video editing
A kind of hacky way of interproccess communication

Aside from the ram versus disk comment above. You may use temp files as precusor files or files about to be processed or served. For example, a server may generate a large PDF for a browser. That PDF file would be stored as a temp file while the (possibly slow) browser downloads the file. Once the communication is complete, the temp file can be destroyed.

For our little 'imagefilesystem' project (http://code.google.com/p/imagefilesystem/) we actually use the /tmp directory to store the thumbnails we created based upon the images in the local filesystem. So the thumbs were created 'on demand' and were, as the name of /tmp says it itself' temporary of nature so that it didn't create GBs of permanent data.

How to get pixel rgb value in hadoop?

I have millions of images stored in hdfs of hadoop. I want to build a index of these images. How to get pixel rgb values of these images? I am new in hadoop, the image format in hadoop is different from the original image binary format. Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency? Many thanks.

I could answer the problem partially.
Another problem is should I use the sequencefile in hadoop to pack the enormous images to a big file for efficiency?
Depends on the size of the individual files. If the individual files are really big, then consolidating them might not really help and the other way also.
Check this query on SO for more details.

If you have the additional storage and efficiency is important to you I would definitely go with a SequenceFile. Hadoop will handle splitting the file up for you. We ran into a case where we were extracting data from imagery file similar to what you are doing. In our case we were extracting metadata for ingestion in a discovery system so that our imagery files could be searched outside of the cluster. In this case because efficiency was not a big deal for us we just process the files individually making sure to make them not splittable. This way the other system can reach back over http to grab the source files.

Storing lots of small files: archive vs. filesystem

I am creating an application that requires a lot of image thumbnails (~3000, 5-25KB). Because speed is essential I plan on loading these images into memory when the application starts. At runtime, new thumbnails will be downloaded and added to the collective.
I could store them all in a folder, but reading thousands of files into memory when a program starts hardly seems efficient.
My second option would be to save them in some kind of (compressed) archive. This would make storage itself and loading more efficient (I think). However, new files will be added regularly, and that will probably not go as smoothly as just saving them in a folder.
Is storing a cache of small files in a (compressed) archive a bad idea or not? Are ZIP files the way to go? Would I be better off using uncompressed archives (and if so, what kind)?
All image files will be JPEG's.
Thanks in advance!
EDIT: I am considering to drop the "load everything into memory on application start" thing. This would simplify my question a little. My initial idea to put everything in one big file now seems less beneficial, since the problem of many files in one directory can be solved by hashing into subdirectories.

Small files don't compress especially well, so you may not gain much compression.
While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.
I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.
I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.
New images could be added at the end, and the index updated appropriately.
It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.
I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.

All disk based storage, and most database, allocate space in chunks. The chunks on large capacity disks can be large. If you have 5kb files and a 32kb disk chunk you end up with 85% wasted space on your storage.
Using an archive won't compress jpeg much because the jpeg encoding algorithm already does that. It will however save you the wasted space on the storage media. It does make things more complicated and perhaps a little slower.

In my opinion I think that the zip file way it´s a bad idea, because you will slowdown everything with the process to load the zip file and unzip it to extract each image.
I think that the purpose of a thumbnail image is that by nature is small so your app plus hardware can load it as fast as possible. So I believe that it is a better idea to load each image as you need it.

Well, if you have small, "geometric" pictures, you may implement them as objects of type javax.swing.Icon rather than images to load from the filesystem.
http://download.oracle.com/javase/6/docs/api/javax/swing/Icon.html
http://download.oracle.com/javase/tutorial/uiswing/components/icon.html
So you will implement one or more objects which draw themselves onto a Graphics surface using the Graphics drawing primitives, instead of copying pixels.

If this is a web-application then the best performance boost you can get is setting good HTTP caching headers. Having a unique URL for every image (also different URLs for different versions of the same image) makes it possible to set VERY far future expire headers, because changing the image changes the URL leading into refetch.
I won't compress, because JPEG cannot be good compressed and it only costs CPU time.
I would recommend to simply store the images into filesystem and consider the use of libraries like jawr or implement your own caching strategy.

I know this question has already answered but I think you need more options other than zipping.
While zip is good, It's not really affect much for JPEG since JPEG has already compressed.
Other thing you may want to consider is :
Put the image in Content Delivery Network (CDN)
Compress components with gzip ( mean the server will automatically zip every response ) and you dont need to write any code to unzip it later - it's handled by the browser automatically.
Since you mention JPEG, you may want to use JPEGTran.Run jpegtran on all your JPEGs.
This tool does lossless JPEG operations such as rotation and can also be used to optimize and remove comments and other useless information (such as EXIF information) from your images.
jpegtran -copy none -optimize -perfect src.jpg dest.jpg
Use Image Sprites. Instead of asking browser to download many image at same time, ask the browser to only download one.
For the details read : http://developer.yahoo.com/performance/rules.html#opt_images
For the basic examination how to improve your website performance you can try install YSlow ( plugin to detect uneffecient code ) in Firefox.
Hope that helps.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.