Custom Binary Input - Hadoop - java

I am developing a demo application in Hadoop and my input is .mrc image files. I want to load them to hadoop and do some image processing over them.
These are binary files that contain a large header with metadata followed by the data of a set of images. The information on how to read the images is also contained in the header (eg. number_of_images, number_of_pixels_x, number_of_pixels_y, bytes_per_pixel, so after the header bytes, the first [number_of_pixels_x*number_of_pixels_y*bytes_per_pixel] are the first image, then the second and so on].
What is a good Input format for these kinds of files? I thought two possible solutions:
Convert them to sequence files by placing the metadata in the sequence file header and have pairs for each image. In this case can I access the metadata from all mappers?
Write a custom InputFormat and RecordReader and create splits for each image while placing the metadata in distributed cache.
I am new in Hadoop, so I may be missing something. Which approach you think is better? is any other way that I am missing?

Without knowing your file formats, the first option seems to be the better option. Using sequence files you can leverage a lot of SequenceFile related tools to get better performance. However, there are two things that do concern me with this approach.
How will you get your .mrc files into a .seq format?
You mentioned that the header is large, this may reduce the performance of SequenceFiles
But even with those concerns, I think that representing your data in SequenceFile's is the best option.

Related

how costly(Time) are read and write operations on csv file in java?

I am writing a software which has a part dealing with read and write operaions. I am wondering how costly these operations are on a csv file. Is there are any other file formats that consume less time? Because I have to do write and read on csv files at the end of every cycle.
Read and write operations depend on the file system, hardware, software configuration, memory, mermory setup and size of the file to read. But not on the format. A different problem related with this is the cost of parsing the file that surely must relative low as csv is very simple.
The point is that CSV is a good format for tables of data but not for nested data. If your data has a lot of nested information you can separate it into different csv files or you will have some information redundancy that will penalize your performance. But other formats might have other kind of redundancy.
And do not optimize prematurily. If you are reading and writing from the file very frecuently this file will surely be kept on RAM. JSON or a zipped file might save size and be read faster but would have a higher parsing time and could be even slower at the end. And the parsing time depends also on the implemenation of the library (Gson vs Jackson) and version.
It will be nice to know the reasons behind your problem to give better ansewrs.
The cost of reading / writing to a CSV file, and whether it is suitable for your application, depend on the details of your use case. Specifically, if you are simply reading from the beginning of the file and writing to the end of the file, then the CSV format is likely to work fine. However, if you need to access particular records in the middle of your file then you probably wish to choose another format.
The main issue with a CSV file is that it is not a good format choice for random access, since each record (row) is of variable size, so you cannot simply seek to a particular record offset in the file, and instead need to read every row (well, you could still jump and sample, but you cannot seek directly by record offset). Other formats with fixed sized records would allow you to seek directly to a particular record in the file, making updating of an entry in the middle of the file possible without needing to re-read and re-write the entire file.

Best way to compare two very large XML files record by record

I have two large XML files (3GB, 80000 records). One is updated version of another. I want to identify which records changed (were added/updated/deleted). There are some timestamps in the files, but I am not sure they can be trusted. Same with order of records within the files.
The files are too large to load into memory as XML (even one, never mind both).
The way I was thinking about it is to do some sort of parsing/indexing of content offset within the first file on record-level with in-memory map of IDs, then stream the second file and use random-access to compare those records that exist in both. This would probably take 2 or 3 passes but that's fine. But I cannot find easy library/approach that would let me do it. vtd-xml with VTDNavHuge looks interesting, but I cannot understand (from documentation) whether it supports random-access revisiting and loading of records based on pre-saved locations.
Java library/solution is preferred, but C# is acceptable too.
Just parse both documents simultaneously using SAX or StAX until you encounter a difference, then exit. It doesn't keep the document in memory. Any standard XML library will support S(t)AX. The only problem would be if you consider different order of elements to be insignificant...

How to Combine Images without loading them into RAM in Java

I have a very large (around a gigapixel) image I am trying to generate, and so far I can only create images up to around 40 megapixels in a BufferedImage before I get an out of memory error. I want to construct the image piece by piece, then combine the pieces without loading the images into memory. I could also do this by writing each piece to a file, but ImageIO does not support this.
I think JAI can help you build what you want. I would suggest looking at the data structures and streams offered by JAI.
Also, have a look at these questions, might help you with ideas.
How to save a large fractal image with the least possible memory footprint
How to create a big image file from many tiles
Appending to an Image File
You basically want to reverse 2 there.
Good luck with your project ;)
Not a proper solution, just a sketch.
Unpacking a piece of image is not easy when an image is compressed. You can decompress, by an external tool, the image into some trivial format (xpm, uncompressed tiff). Then you could load pieces of this image as byte arrays, because the format is so straightforward, and create Image instances out of these raw data.
I see two easy solutions. Create a custom binary format for your image. For saving, just generate one part at a time, seek() to the appropriate spot in the file, then offload your data. For loading, seek() to the appropriate spot in the file, then load your data.
The other solution is to learn an image format yourself. bmp is uncompressed, but the only easy one to learn. Once learned, the above steps work quite well.
Remember to convert your image to a byte array for easy storage.
If there is no way to do it built into Java (for your sake I hope this is not the case and that someone answers saying so), then you will need to implement an algorithm yourself, just as others have commented here saying so.
You do not necessarily need to understand the entire algorithm yourself. If you take a pre-existing algorithm, you could just modify it to load the file as a byte stream, create a byte buffer to keep reading chunks of the file, and modify the algorithm to accept this data a chunk at a time.
Some algorithms, such as jpg, might not be possible to implement with a linear stream of file chunks in this manner. As #warren suggested, bmp is probably the easiest to implement in this way since that file format just has a header of so many bytes then it just dumps the RGBA data straight out in binary format (along with some padding). So if you were to load up your sub-images that need to be combined, loading them logically 1 at a time (though you could actually multithread this thing and load the next data concurrently to speed it up, as this process is going to take a long time), reading the next line of data, saving that out to your binary output stream, and so on.
You might even need to load the sub-images multiple times. For example, imagine an image being saved which is made up of 4 sub-images in a 2x2 grid. You might need to load image 1, read its first line of data, save that to your new file, release image 1, load image 2, read its first line of data, save, release 2, load 1 to read its 2nd line of data, and so on. You would be more likely to need to do this if you use a compressed image format for saving in.
To suggest a bmp again, since bmp is not compressed and you can just save the data in whatever format you want (assuming the file was opened in a manner which provides random access), you could skip around in the file you're saving so that you can completely read 1 sub-image and save all of its data before moving on to the next one. That might provide run time savings, but it might also provide terrible saved file sizes.
And I could go on. There are likely to be multiple pitfalls, optimizations, and so on.
Instead of saving 1 huge file which is the result of combining other files, what if you created a new image file format which was merely made up of meta-data allowing it to reference other files in a way which combined them logically without actually creating 1 massive file? Whether or not creating a new image file format is an option depends on your software; if you are expecting people to take these images to use in other software, then this would not work - at least, not unless you could get your new image file format to catch on and become standard.

best practices question: How to save a collection of images and a java object in a single file? File is read to be rendered

I am making a java program that has a collection of flash-card like objects. I store the objects in a jtree composed of defaultmutabletreenodes. Each node has a user object attached to it with has a few string/native data type parameters. However, i also want each of these objects to have an image (typical formats, jpg, png etc).
I would like to be able to store all of this information, including the images and the tree data to the disk in a single file so the file can be transferred between users and the entire tree, including the images and parameters for each object, can be reconstructed.
I had not approached a problem like this before so I was not sure what the best practices were. I found XLMEncoder (http://java.sun.com/j2se/1.4.2/docs/api/java/beans/XMLEncoder.html) to be a very effective way of storing my tree and the native data type information. However I couldn't figure out how to save the image data itself inside of the XML file, and I'm not sure it is possible since the data is binary (so restricted characters would be invalid). My next thought was to associate a hash string instead of an image within each user object, and then gzip together all of the images, with the hash strings as the names and the XMLencoded tree in the same compmressed file. That seemed really contrived though.
Does anyone know a good approach for this type of issue?
THanks!
Thanks!
Assuming this isn't just a serializable graph, consider bundling the files together in Jar format. If you already have your data structures working with XMLEncoder, you can reuse this code by saving the data as a jar entry.
If memory serves, the jar library has better support for Unicode name entries than the zip package, which is why I would favour it.
You might consider using an MS JET database (.mdb file) and storing all the stuff in there. That'll also make it easy to examine and edit the data in (for example) MS Access.
You can employ some virtual file system, which stores it's data in a single container. We develop and offer one of such files sytems, SolFS, however right now there's no Java binding for it. We will release Java JNI interface for SolFS within a month.

Palm Database (PDB) files in Java?

Has anybody written any classes for reading and writing Palm Database (PDB) files in Java? (I mean on a server, not on the Palm device itself.) I tried to google, but all I got were Protein Data Bank references.
I wrote a Perl program that does it using Palm::PDB.pm, but I want to turn it into a servlet for a GWT app.
The jSyncManager project at http://www.jsyncmanager.org/ is under the LGPL and includes classes to read and write PDB files -- look in jSyncManager/API/Protocol/Util/DLPDatabase.java in its source code. It looks like the core code you need from this could be isolated from the rest of the library with a little effort.
There are a few ways that you can go about this;
Easiest but slowest: Find a perl-> java bridge. This will not be quick, but it will work and it should involve the least amount of work.
Find a C++/C# implementation that you have the source to and convert it (this should be the fastest solution)
Find a Java reader ... there seems to be a few listed under google... however I do not have any experience with these.
Depending on what your intended usage is, you might look into writing a simple reader yourself. The format is pretty simple and you only need to handle a couple of simple fields to parse it.
Basically there is a header for the entire file which has a 2 byte integer at the end which specifies the number of record. So just skip your way through the bytes for all the other fields in the header and then read the last field which is the number of records in the file. Be aware that the PDB format writes integers with most significant byte first.
Following this, there will be a record header for each record, the first field of which is the actual offset into the file for the record itself. Again, be aware of the byte order.
So, now you have the offsets into the file for each record in the file, which should make it very easy to read the actual records as long as you know the format of these for the type of PDB file you are trying to read.
Wikipedia has a nice overview of the header formats.
Maybe JPilot can help? They must have a lot of Java code dealing with Palm OS data.

Categories