I have a servlet with an API that delivers images from GET requests. The servlet creates a data file of CAD commands based on the parameters of the GET request. This data file is then delivered to an image parser, which creates an image on the file system. The servlet reads the image and returns the bytes on the response.
All of the IO and the calling of the image parser program can be very taxing and images of around 80kb are rendering in 3-4000ms on a local system.
There are roughly 20 parameters that make up the GET request. Each correlates to a different portion of the image. So, the combinations of possible images is extremely large.
To alleviate the loading time, I plan to store BLOBs of rendered images in a database. If a GET request matches one previously executed, I will pull from cache. Else, I will render a new one. This does not fix "first-time" run, but will help "n+1 runs".
Any other ideas on how I can improve performance?
you can store file on you disk,and image path in database,because database storage is usually more expensive than file system storage.
sort the http get parameters and hash them as an index to that image record for fast query by parameters.
to make sure your program not crush when disk capacity not enough,you should remove the the unused or rarely used record:
store a lastAccessedTime for each record,updated each time when the image is requested.
using a scheduler to check lastAccessedTime,removing records which is lower than a specified weight.
you can use different strategy to calculate the weight,such as lastAccessedTime,accessedCount,image size,etc.
You can turn all the parameters that you feed into the rendering pipeline into a single String in a predictable way such that you can compute a SHA1 hash of the input then store the output file in a directory with the SHA1 as the file name, that way if you get a request with the same parameters you just compute the hash then check if the file is on disk if it is return it otherwise send the work to the render pipeline and create the file.
If you have a lot of files you might want to use more than one directory, maybe look at how git divides up files across directories by the first few chars of the SHA1 for inspiration.
I use a similar setup on my app I am not doing rendering just storing files, the files are stored in the db but for performance reasons I serve them out from disk using the sha1 hash of the file contents as the filename / URI for the file.
Related
I'm having issues with figuring out how I can store and scan large amounts of visited URLS from a web crawler. The idea is that the number of visited URL's will eventually be too much to store in memory and I should store them in file but I was wondering, doesn't this become very inefficient? If after getting a batch of URL's and I want to check if the URL is already visited I have to check the visited file line by line and see if there is a match?
I had thought about using a cache but the problem still remains when the URL is not found in the cache and I would still have to check the file. Do I have to check the file line-by-line for every URL and is there a better/more efficient way to do this?
A key data-structure here could be a Bloom Filter, and Guava provides an implementation. The Bloom-filter would tell you (maybe you have visited the URL) or you haven't surely. If the result is a maybe you can go and check the file if it's already visited otherwise you go and visit the URL and store it in the file as well as in the Bloom Filter.
Now, to optimise the file seeks, you can hash the URL to get a fixed size byte[] rather than unfixed string length (ex: md5).
byte[] hash = md5(url);
if(bloomFilter.maybe(hash)){
checkTheFile(hash);
}else{
visitUrl(url);
addToFile(hash);
addToBloomFilter(hash);
}
You can use a database and the hash being the primary key to get a O(1) access time when you check if a key exists, or You can implement an index yourself.
What about having a file per URL? If the file exists then the URL has been crawled.
You can then get more sophisticated and have this file contain data that indicates the results of the last crawl, how long you want to wait before the next crawl, etc. (It's handy to be able to spot 404s, for example, and decide whether to try again or abandon that URL.)
With this approach, it's worth normalising the URLs so that minor differences in the URL don't result in different files.
I've used this technique in my node crawler (https://www.npmjs.com/package/node-nutch) and it allows me to store the status either in the file system (for simple crawls) or on S3 (for crawls involving more than one server).
I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.
These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].
What is a good Input format for these kinds of files? I thought two possible solutions:
Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?
Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.
How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.
I have a project where I need to generate a PDF file. Within this PDF I have to insert a body of text as well as four or five large images (roughly 800px*1000px). In order to make this flexible I have opted to use FreeMarker in conjunction with XHTMLRenderer (flying-saucer).
I am now faced with a couple of options:
Create the images and save them as temporary files to disk. Then process an .xhtml template with FreeMarker (saving it to disk) and pass the processed .xhtml file URL to XHTMLRenderer to generate the PDF. All these created files (bar the PDF) would be created with File.createTempFile. This would allow FreeMarker to pick the images up off the disk (as if they were images linked in the XHTML)
Process the .xhtml template and keep it in memory. Pass the images to the template as base64 encoded data urls. This would remove the need for saving any temporary files as the output from FreeMarker could be passed directly to XHTMLRenderer.
Base64 Encoded Image Url example (a small folder icon):
<img src="
/ge8WSLf/rhf/3kdbW1mxsbP//mf///yH5BAAAAAAALAAAAAAQAA4AAARe8L1Ekyky67QZ1hLnjM5UUde0ECwLJoExK
cppV0aCcGCmTIHEIUEqjgaORCMxIC6e0CcguWw6aFjsVMkkIr7g77ZKPJjPZqIyd7sJAgVGoEGv2xsBxqNgYPj/gAwXEQA7" />
My main question is which would be a better technique? Is creating lots of temporary files bad (does it carry lots of overhead)? Could I potentially run out of memory creating such large base64 encoded strings?
I found myself asking the same question recently. After some benchmarking, it turns out the data URI approach was the best bet.
Storing a bunch of Base64-encoded images can be expensive. But the overhead for creating temp files, streaming image data in, then waiting for XHTMLRenderer hit that temp file 4 times before cleaning it up is also taxing.
In my experiments, the Base64 images proved to be a better approach. That being said, I'm not sure to what extent it will remain true for larger images. In my case, I was testing with 32x32 icons, 80x80 logos, 400x240 bar graphs and one 600x400 graphic. The difference in overhead was significant with everything except the 600x400 graphic, where it got really negligible.
(A side note for Joop Eggen- In my case, PDF generation is time critical. The user clicks a button the PDF and expects the download to begin immediately.)
PDF generation is not time critical - one might even considering throtling the communication. Embedding images in Base64 costs a bit more CPU and memory in an already costly templating transformation: the Base64 buld data is dragged through the templating pipeling, then probably decoded from Base64 to binary to be compressed. I even was unaware that embedded images are possible. So the overhead of temp files is a more sure solution. Certainly to start with. Of course one can benchmark both cases.
I have a very large (around a gigapixel) image I am trying to generate, and so far I can only create images up to around 40 megapixels in a BufferedImage before I get an out of memory error. I want to construct the image piece by piece, then combine the pieces without loading the images into memory. I could also do this by writing each piece to a file, but ImageIO does not support this.
I think JAI can help you build what you want. I would suggest looking at the data structures and streams offered by JAI.
Also, have a look at these questions, might help you with ideas.
How to save a large fractal image with the least possible memory footprint
How to create a big image file from many tiles
Appending to an Image File
You basically want to reverse 2 there.
Good luck with your project ;)
Not a proper solution, just a sketch.
Unpacking a piece of image is not easy when an image is compressed. You can decompress, by an external tool, the image into some trivial format (xpm, uncompressed tiff). Then you could load pieces of this image as byte arrays, because the format is so straightforward, and create Image instances out of these raw data.
I see two easy solutions. Create a custom binary format for your image. For saving, just generate one part at a time, seek() to the appropriate spot in the file, then offload your data. For loading, seek() to the appropriate spot in the file, then load your data.
The other solution is to learn an image format yourself. bmp is uncompressed, but the only easy one to learn. Once learned, the above steps work quite well.
Remember to convert your image to a byte array for easy storage.
If there is no way to do it built into Java (for your sake I hope this is not the case and that someone answers saying so), then you will need to implement an algorithm yourself, just as others have commented here saying so.
You do not necessarily need to understand the entire algorithm yourself. If you take a pre-existing algorithm, you could just modify it to load the file as a byte stream, create a byte buffer to keep reading chunks of the file, and modify the algorithm to accept this data a chunk at a time.
Some algorithms, such as jpg, might not be possible to implement with a linear stream of file chunks in this manner. As #warren suggested, bmp is probably the easiest to implement in this way since that file format just has a header of so many bytes then it just dumps the RGBA data straight out in binary format (along with some padding). So if you were to load up your sub-images that need to be combined, loading them logically 1 at a time (though you could actually multithread this thing and load the next data concurrently to speed it up, as this process is going to take a long time), reading the next line of data, saving that out to your binary output stream, and so on.
You might even need to load the sub-images multiple times. For example, imagine an image being saved which is made up of 4 sub-images in a 2x2 grid. You might need to load image 1, read its first line of data, save that to your new file, release image 1, load image 2, read its first line of data, save, release 2, load 1 to read its 2nd line of data, and so on. You would be more likely to need to do this if you use a compressed image format for saving in.
To suggest a bmp again, since bmp is not compressed and you can just save the data in whatever format you want (assuming the file was opened in a manner which provides random access), you could skip around in the file you're saving so that you can completely read 1 sub-image and save all of its data before moving on to the next one. That might provide run time savings, but it might also provide terrible saved file sizes.
And I could go on. There are likely to be multiple pitfalls, optimizations, and so on.
Instead of saving 1 huge file which is the result of combining other files, what if you created a new image file format which was merely made up of meta-data allowing it to reference other files in a way which combined them logically without actually creating 1 massive file? Whether or not creating a new image file format is an option depends on your software; if you are expecting people to take these images to use in other software, then this would not work - at least, not unless you could get your new image file format to catch on and become standard.
I need some way to store a configuration/status file that needs to be changed rapidly. The status of each key value pair (key-value) is stored in that file. The status needs to be changed rather too rapidly as per the status of a communication (Digital multimedia broadcasting) hardware.
What is the best way to go about creating such a file? ini? XML? Any off the shelf filewriter in Java? I can't use databases.
It sounds like you need random access to update parts of the file frequently without re-writing the entire file. Design binary file format and use RandomAccessFile API to read/write it. You are going to want to use fixed number of bytes for key and for value, such that you can index into the middle of the file and update the value without having to re-write all of the following records. Basically, you would be re-implementing how a database stores a table.
Another alternative is to only store a single key-value pair per file such that the cost of re-writing the file is minor. Maybe you can think of a way to use file name as the key and only store value in the file content.
I'd be inclined to try the second option unless you are dealing with more than a few thousand records.
The obvious solution would be to put the "configuration" information into a Properties object, and then use Properties.store(...) or Properties.storeToXML(...) to save to a file output stream or writer.
You also need to do something to ensure that whatever is reading the file will see a consistent snapshot. For instance, you could write to a new file each time and do a delete / rename dance to replace the the old with the new.
But if the update rate for the file is too high, you are going to create a lot of disc traffic, and you are bound slow down your application. This is going to apply (eventually) no matter what file format / API you use. So, you may want to consider not writing to a file at all.
At some point, configuration that changes too rapidly becomes "program state" and not configuration. If it is changing so rapidly, why do you have confidence that you can meaningfully write it to, and then read it from, a filesystem?
Say more about what the status is an who the consumer of the data is...