How to save models from ML Pipeline to S3 or HDFS? - java

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.
How to save models to Amazon S3?

One way to save a model to HDFS is as following:
// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")
Saved model can then be loaded as:
val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
For more details see (ref)

Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.
val logRegModel = LogisticRegressionModel.load("myModel.model")

So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.
That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

Related

How to move (or copy) blobs between containers in Azure Blob Storage using java

I have a java application which I use to upload some artifacts to Azure Blob Storage. Now I am trying to move these files from one container to another. Below is a sample code on what I do right now.
BlobServiceClient blobServiceClient = new BlobServiceClientBuilder().connectionString(CONNECTION_STRING).buildClient();
BlobContainerClient releaseContainer = blobServiceClient.getBlobContainerClient(RELEASE_CONTAINER);
BlobContainerClient backupContainer = blobServiceClient.getBlobContainerClient(BACKUP_CONTAINER);
for (BlobItem blobItem : releaseContainer.listBlobs()) {
BlobClient destBlobClient = backupContainer.getBlobClient(blobItem.getName());
BlobClient sourceBlobClient = releaseContainer.getBlobClient(blobItem.getName());
destBlobClient.copyFromUrl(sourceBlobClient.getBlobUrl());
sourceBlobClient.delete();
}
Is there a more straight forward way to do this?
Also if this is the way to do it how can I delete the old file? sourceBlobClient.delete() doesn't work now.
There is similar discussion in SO can you refer to the suggestion and let me know the status
Additional information: Have you check Azcopy tool, one of the best to tool to move/copy data
I see you have posted the similar query in the Q&A Forum
You can move files using the rename method from. azure-storage-file-datalake java client.
DataLakeServiceClient storageClient = new DataLakeServiceClientBuilder().endpoint(endpoint).credential(credential).buildClient();
DataLakeFileSystemClient dataLakeFileSystemClient = storageClient.getFileSystemClient("storage");
DataLakeFileClient fileClient = dataLakeFileSystemClient.getFileClient("src/path");
fileClient.rename("storage", "dest/path");
method documentation here
Note that the azure-storage-file-datalake java client gives you great options to deal with your storage account as a filesystem. But take care there's no way to copy files there.

Most practical way to read an Azure Blob (PDF) in the Cloud?

I'm somewhat of a beginner and have never dealt with cloud-based solutions yet before.
My program uses the PDFBox library to extract data from PDFs and rename the file based on the data. It's all local currently, but eventually will need to be deployed as an Azure Function. The PDFs will be stored in an Azure Blob Container - the Azure Blob Storage trigger for Azure Functions is an important reason for this choice.
Of course I can download the blob locally and read it, but the program should run solely in the Cloud. I've tried reading the blobs directly using Java, but this resulted in gibberish data and wasn't compatible with PDFbox. My plan for now is to temp store the files elsewhere in the Cloud (e.g. OneDrive, Azure File Storage) and try opening them from there. However, this seems like it can quickly turn into an overly messy solution. My questions:
(1) Is there any way a blob can be opened as a File, rather than a CloudBlockBlob so this additional step isn't needed?
(2) If no, what would be a recommended temporary storage be in this case?
(3) Are there any alternative ways to approach this issue?
Since you are planning Azure function, you can use blob trigger/binding to get the bytes directly. Then you can use PDFBox PdfDocument load method to directly build the object PDDocument.load(content). You won't need any temporary storage to store the file to load that.
#FunctionName("blobprocessor")
public void run(
#BlobTrigger(name = "file",
dataType = "binary",
path = "myblob/{name}",
connection = "MyStorageAccountAppSetting") byte[] content,
#BindingName("name") String filename,
final ExecutionContext context
) {
context.getLogger().info("Name: " + filename + " Size: " + content.length + " bytes");
PDDocument doc = PDDocument.load(content);
// do your stuffs
}

Save image file to HDFS using Spark

I have an image file
image = JavaSparkContext.binaryFiles("/path/to/image.jpg");
I would like to process then save the binary info using Spark to HDFSSomething like :
image.saveAsBinaryFile("hdfs://cluster:port/path/to/image.jpg")
Is this possible, not saying 'as simple', just possible to do this? if so how would you do this. Trying to keep a one to one if possible as in keeping the extension and type, so if I directly download using hdfs command line it would still be a viable image file.
Yes, it is possible. But you need some data serialization plugin, for example avro(https://github.com/databricks/spark-avro).
Assume image is presented as binary(byte[]) in your program, so the images can be a Dataset<byte[]>.
You can save it using
datasetOfImages.write()
.format("com.databricks.spark.avro")
.save("hdfs://cluster:port/path/to/images.avro");
images.avro would be a folder contains multiple partitions and each partition would be an avro file saving some images.
Edit:
it is also possible but not recommended to save the images as separated files. You can call foreach on the dataset and use HDFS api to save the image.
see below for a piece of code written in Scala. You should be able to translate it into Java.
import org.apache.hadoop.fs.{FileSystem, Path}
datasetOfImages.foreachPartition { images =>
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
images.foreach { image =>
val out = fs.create(new Path("/path/to/this/image"))
out.write(image);
out.close();
}
}

Encog, save to file trained network

Is it possible to save a trained network to a file, then to use it again (to load file)? You can give a simple example? Currently I should run every time training:
EncogUtility.trainConsole (network, trainingSet, TRAINING_MINUTES)
You can use something like this to save the trained network file (this is C#, Java may have a different class for FileInfo):
FileInfo networkFile = new FileInfo(#"C:\Data\network.eg");
Encog.Persist.EncogDirectoryPersistence.SaveObject(networkFile, (BasicNetwork)network);
You can then use something like this to reload the network file:
network = (BasicNetwork)(Encog.Persist.EncogDirectoryPersistence.LoadObject(networkFile));
Use this example to save/load your network:
import static org.encog.persist.EncogDirectoryPersistence.*;
String filename = "C:/tmp/network.eg";
// save network...
saveObject(new File(filename), network);
// load network...
BasicNetwork loadFromFileNetwork = (BasicNetwork) loadObject(new File(filename));
Source: https://github.com/encog

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!
I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

Categories