Save image file to HDFS using Spark - java

I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.

Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}

Related

Apache FOP and Java Image Issues - Combining multiple sources

I am trying to "automate" the building of a PDF using Apache FOP and Java. I want to minimize the hard coding since I don't know in advance all the file combinations I am going to need to support. In addition I want to try and not save files on the hard drive. Files on the HD introduces security, performance, threading and cleanup considerations I would rather not handle.
The test case I am using right now has 1 FO and 2 PNG files. One of the PNG files is over 1MB.
Ideally I would create 3 sources:
InputStream fo = new InputStream(new File("C:\\Temp\\FOP\\Test\\blah.fo"));
InputStream png1 = new InputStream(new File("C:\\Temp\\FOP\\Test\\image-1.png"));
InputStream png2 = new InputStream(new File("C:\\Temp\\FOP\\Test\\image-2.png"));
Source foSrc = new StreamSource(fo);
Source png1Src = new StreamSource(png1);
Source png2Src = new StreamSource(png2);
and then combine them all together to generate the PDF. I can't find a way using the API to do that.
The FO files refers to the images via:
<fo:external-graphic src="file:image-1.png"/>
<fo:external-graphic src="file:image-2.png"/>
When I use the command line FOP tools, it builds the PDF as I would expect. As long as the two images are in the same directory as the FO file, then all is good. Using the command line, there is no need to point out the existence or location of the images.
When using Java, I have tried a number of configurations, but none of them fit my need:
I saved the FO file and the 2 images into the same directory and referred to them using the following FopFactory constructor:
private static final FopFactory fopFactory = FopFactory.newInstance(new File("C:\\Temp\\FOP\\test").toURI());
This code base only finds the smaller of the two images. It seems like the larger one is being ignored since it is bigger than some limit.
I have tried the above constructor using various relative and absolute paths.
I have tried constructing FopFactory using the default "fop.xconf" file and adding the "C:\Temp\FOP\Test" directory to the classpath.
I have "hardcoded" the files and their locations in the FO file.
I have tried using intermediate files structure (IFDocumentHandler, IFSerializer and IFConcatenator) for the images and get errors that way. Seems the intermediate files are not intended for images.
I have been able to embed the file into the FO file using base64 encoding and the syntax:
<fo:external-graphic src="url('data:image/png;base64,iVBORw...ggg==')"/>
The last one seems like the best solution other than taking 3 sources and using all 3 to generate the PDF. Any suggestions on how to use the API to combine the 3 sources?
Thanks.

Saving existent CSV file in Android with Processing

I am new in android programming and I am struggeling with saving an existent CSV file. I wrote the Code with Java in Processing. and it works on the PC but now I would like to switch to android Mode - but how can I Move the CSV file to my phone? and is there an easy way so i can use the Command:
table = loadTable("Vocstest.csv", "header");
I use Processing 3.37
The code you wrote will work just fine on an android phone. I have used the same code as yours in an app.
The difference may be that i do not try to override it (by saving) I am only accessing it to retrieve the data.
You have to add your file in the "data" directory in your project. If the "data" folder does not exist you can create it and put in your csv file.
example:
Table aQaK = loadTable("aQaK_ar.csv", "header");
TableRow myrow = aQaK.getRow(myversenum);
String myversetxt = myrow.getString("AyahText");
Hope this helps. Peace.

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.
How to save models to Amazon S3?
One way to save a model to HDFS is as following:
// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")
Saved model can then be loaded as:
val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
For more details see (ref)
Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.
val logRegModel = LogisticRegressionModel.load("myModel.model")
So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.
That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!
I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

Input Stream from ZipResourceFile talking lot of time to read data

I have successfully implemented Apk Expansion Files for my project.
Problem:In my .obb i have a folder which has 100 xml files in it.Now the problem is i am using the below code to read the data directly from .obb files without extracting the data.
this is code given in the offical doc here http://developer.android.com/google/play/expansion-files.html under the topic Reading from a ZIP file
ZipResourceFile expansionFile = APKExpansionSupport.getAPKExpansionZipFile(MyActivity.this, 1, 0);
String pathToFileInsideZip = "main.1.com.my.expansionfiles.obb/data/" +filename;
InputStream fileStream = expansionFile.getInputStream(pathToFileInsideZip);
i have a for loop in that i am calling writing this code so that it will read all the xml one by one and make the data ready for me to display.
The above will read the data directly from .obb file, but the problem is its talking lot of time to extract the data?
Why so? i am doing any mistake here?
I believed that the pathToFileInsideZip would not be
main.1.com.my.expansionfiles.obb/data/[files].
I think it's just
"data/[files]"

Categories