Best way for Hadoop to handle lots of image files

Best way for Hadoop to handle lots of image files - java

I've successfully managed to handle multiple image files in 2 ways in Hadoop:
Using Java to stitch images together using a sequence file. This requires a text file that points to the locations of all the files.
Using Python and Hadoop streaming to cache files to each node using -cacheArchive in the form of a tar.gz archive.
Both methods seem a bit ropey to me. Let's say I have one million files, I do not want to have to create the text file or zip up so many files. Is there a way i can just point my mapper to an hdfs folder and have it read through that folder at runtime? I know input can be used, but this is for text files. Or am I missing something? Any pointers are most appreciated.

Related

Get RandomAccessFile from JAR archive

Summary:
I have a program I want to ship as a single jar file.
It depends on three big resource files (700MB each) in a binary format. The file content can easily be accessed via indexing, my parser therefore reads these files as RandomAccessFile-objects.
So my goal is to access resource files from a jar via File objects.
My problem:
When accessing the resource files from my file system, there is no issue, but I aim to pack them into the jar file of the program, so the user does not need to handle these files themselves.
The only way I found so far to access a file packed in a jar is via InputStream (generated by class.getResourceAsStream()), which is totally useless for my application as it would be much too slow reading these files from start to end instead of using RandomAccessFile.
Copying the file content into a file, reading it and deleting it in runtime is no option eigher for the same reason.
Can someone confirm that there is no way to achieve my goal or provide a solution (or a hint so I can work it out myself)?
What I found so far:
I found this answer and if I understand the answer it says that there is no way to solve my problem:
Resources in a .jar file are not files in the sense that the OS can access them directly via normal file access APIs.
And since java.io.File represents exactly that kind of file (i.e. a thing that looks like a file to the OS), it can't be used to refer to anything in a .jar file.
A possible workaround is to extract the resource to a temporary file and refer to that with a File.
I think I can follow the reasoning behind it, but it is over eight years old now and while I am not very educated when it comes to file systems and archives, I know that the Java language has evolved quite much since then, so maybe there is hope? :)
Probably useless background information:
The files are genomes in the 2bit format and I use the TwoBitParser from biojava via the wrapper class TwoBitFacade?. The Javadocs can be found here and here.

Resources are not files, and they live in a JAR file, which is not a random access medium.

Looking for an efficient file caching system

I'm currently developing an MMO which utilizes numerous sprites (image files), and I plan to store these files in a compressed state on the user's hard drive. I was wondering if there already exists an implementation of an efficient, directory-based cache system, in which I can utilize to store these image files in different folders that can compress into either one file or multiple files. I was also researching LZ4 (de)compression, and I suppose that would be useful as well, but that does not solve the directory issue.
Thanks!
EDIT: For example, one file should hold numerous image files.
If something like this does not exist, what would the fastest way be to compress multiple image files into one file, and then decompress to load them into memory when the program starts?

mapreduce in java - gzip input files

I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.
I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.
I've asked around at my workplace, but only got references to scala, which i'm not familier with.
Any help would be appreciated.

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.
So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.
But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

TrueZip Random Access Functionality

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).

The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.

I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.

The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

how to write into a text file in Java

I am doing a project in java and in that i need to add and modify my
text file at runtime,which is grouped in the jar.
I am using class.getResourceAsStream(filename) this method we
can read that file from class path.
i want to write into the same textfile.
What is the possible solution for this.
If i can't update the text file in jar what other solution is there?
Appreciate any help.

The easiest solution here is to not put the file in the jar. It sounds like you are putting files in your jar so that your user only needs to worry about one file that contains everything related to that program. This is an artificial constraint and just add headaches.
There is a simple solution that still allows you to distribute just the jar file. At start up, attempt to read the file from the file system. If you don't find it, use default values that are encoded in you program. Then when changes are made, you can write it to the file system.

In general, you can't update a file that you located using getResourceAsStream. It might be a file in a JAR/ZIP file ... and writing it would entail rewriting the entire JAR file. It might be a remote file served up by a Url classloader.
For your sanity (and good practice), you should not attempt to update files that you access via the classpath. If you need to, read the file out of the JAR file (or whatever), copy it into the regular file system, and then update the copy.
I'm not saying that it is impossible to do this in all cases. Indeed, in most normal cases you can do it with some effort. However, this is not supported, and there are no standard APIs for doing this.
Furthermore, attempts to update resources are liable to cause anomalies in the classloader. For example, I'd expect resources in JAR files to not update (from the perspective of the application) until the application restarted. But resources in exploded JAR files probably would update ... though new resources might not show up.
Finally, there are cases where updating a resource is impossible:
When the user doesn't have write access to the application's installation directory. This is typical for a properly administered UNIX / Linux machine.
When the JAR file is fetched from a remote server, you are likely not to be able to write the updates back.
When you are using an arbitrary custom classloader, you've got no way of knowing where the actual bytes of an updated resource should be stored, and no way of storing them.

All JAR rewriting techniques in Java look similar. Open the Jar file, read all of it's contents, and write a new Jar file containing the unmodified contents (and the modifications you whished to make). Such techniques are not advisable for a Jar file on the class path, much less a Jar file you're running from.
If you decide you must do it this way, Java World has a few articles:
Modifying Archives, Part 1
Modifying Archives, Part 2

A good solution that avoids the need to put your items into a Jar file is to read (if present) a properties file out of a hidden subdirectory in the user's home directory. The logic looks a bit like this:
if (the hidden directory named after my application doesn't exist) {
makeTheHiddenDirectory();
writeTheDefaultPropertiesFile();
}
Properties appProps = new Properties();
appProps.load(new FileInputStream(fileInHiddenDir));
...
... After the appProps have changed ...
...
appProps.store(new FileOutputStream(fileInHiddenDir), "Do not modify this file");
Look to java.util.Properties, and keep in mind that they have two different load and store formats (key = value based and XML based). Pick the one that suits you best.

If i can't update the text file in jar what other solution is there?
Store the information in any of:
Cookies
The server
Deploy the applet using 1.6.0_10+, launch it using JWS and use the PersistenceService to store the information. Here is my demo. of the PersistenceService.
Also, if your users will agree to a trusted applet (which seems overkill for this), you might write the information to a sub-directory of user.home.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.