How to perform Update operations in GridFS (using Java)? - java

I am using Mongo-Java-Driver 2.13 I stored a PDF file (size 30mb) in GridFS. I am able to perform insertion, deletion and find operation easily.
MongoClient mongo = new MongoClient("localhost", 27017);
DB db = mongo.getDB("testDB");
File pdfFile = new File("/home/dev/abc.pdf");
GridFS gfs = new GridFS(db,"books");
GridFSInputFile inputFile = gfs.createFile(pdfFile);
inputFile.setId("101");
inputFile.put("title", "abc");
inputFile.put("author", "xyz");
inputFile.save();
data is persisted in books.files and books.chunks collections. Now I want to update :
case 1: pdf file
case 2: title or author
How to perform these Update operations for Case 1 in GridFS ?
I came to know that I need to maintain multiple versions of my files and pick up the right version. Can anybody put some clarity on it?
Edit:
I can update metadata(title, author) easily.
GridFSDBFile outputFile = gfs.findOne(new BasicDBObject("_id", 101));
BasicDBObject updatedMetadata = new BasicDBObject();
updatedMetadata.put("name", "PG");
updatedMetadata.put("age", 22);
outputFile.setMetaData(newMetadata);
outputFile.save();

In GridFS you are not removing/deleting a single document but actually a bunch of documents (files are split into chunks and each chunk is a separate document). That means replacing a file is simply not possible in an atomic manner.
What you can do instead is:
insert a new file with a new name
after this happened (use the replica acknowledged write-concern), update all references to the old file to point to the new one
after you got a confirmation for this, you can delete the old file
GridFS is kind of a hackish feature. It is often better to just use a separate fileserver with a real filesystem to store the file content and only store the metadata in MongoDB.

Related

Write stream into mongoDB in Java

I have a file to store in mongoDB. What I want is to avoid loading the whole file (which could be several MBs in size) instead I want to open the stream and direct it to mongoDB to keep the write operation performant. I dont mind storing the content in base64 encoded byte[].
Afterwards I want to do the same at the time of reading the file i.e. not to load the whole file in memory, instead read it in a stream.
I am currently using hibernate-ogm with Vertx server but I am open to switch to a different api if it servers the cause efficiently.
I want to actually store a document with several fields and several attachments.
You can use GridFS. Especially when you need to store larger files (>16MB) this is the recommended method:
File f = new File("sample.zip");
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(f);
gfsFile.setFilename(f.getName());
gfsFile.setId(id);
gfsFile.save();
Or in case you have an InputStream in:
GridFS gfs = new GridFS(db, "zips");
GridFSInputFile gfsFile = gfs.createFile(in);
gfsFile.setFilename("sample.zip");
gfsFile.setId(id);
gfsFile.save();
You can load a file using one of the GridFS.find methods:
GridFSDBFile gfsFile = gfs.findOne(id);
InputStream in = gfsFile.getInputStream();

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.
How to save models to Amazon S3?
One way to save a model to HDFS is as following:
// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")
Saved model can then be loaded as:
val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
For more details see (ref)
Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.
val logRegModel = LogisticRegressionModel.load("myModel.model")
So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.
That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

How can I process a large file via CSVParser?

I have a large .csv file (about 300 MB), which is read from a remote host, and parsed into a target file, but I don't need to copy all the lines to the target file. While copying, I need to read each line from the source and if it passes some predicate, add the line to the target file.
I suppose that Apache CSV ( apache.commons.csv ) can only parse whole file
CSVFormat csvFileFormat = CSVFormat.EXCEL.withHeader();
CSVParser csvFileParser = new CSVParser("filePath", csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
so I can't use BufferedReader. Based on my code, a new CSVParser() instance should be created for each line, which looks inefficient.
How can I parse a single line (with known header of the table) in the case above?
No matter what you do, all of the data from your file is going to come over to your local machine because your system needs to parse through it to determine validity. Whether the file arrives via a file read through the parser (so you can parse each line), or whether you just copy the entire file over for parsing purposes, it will all come over to local. You will need to get the data local, then trim the excess.
Calling csvFileParser.getRecords() is already a lost battle because the documentation explains that that method loads every row of your file into memory. To parse the record while conserving active memory, you should instead iterate over each record; the documentation implies the following code loads one record to memory at a time:
CSVParser csvFileParser = CSVParser.parse(new File("filePath"), StandardCharsets.UTF_8, csvFileFormat);
for (CSVRecord csvRecord : csvFileParser) {
... // qualify the csvRecord; output qualified row to new file and flush as needed.
}
Since you explained that "filePath" is not local, the above solution is prone to failure due to connectivity issues. To eliminate connectivity issues, I recommend you copy the entire remote file over to local, ensure the file copied accurately by comparing checksums, parse the local copy to create your target file, then delete the local copy after completion.
This is a late response, but you CAN use a BufferedReader with the CSVParser:
try (BufferedReader reader = new BufferedReader(new FileReader(fileName), 1048576 * 10)) {
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(reader);
for (CSVRecord line: records) {
// Process each line here
}
catch (...) { // handle exceptions from your bufferedreader here

Sparql query doesn't upadate when insert some data through java code

I'm trying to insert data through my java code to the owl file which is loaded into Fuseki server. Update query doesn't give any error message. But owl file doesn't update.I'm using jena library and implemented using java code. What is the wrong in my code?
public boolean addLecturerTriples(String fName, String lName,
String id, String module) {
try{
ArrayList<String> subject = new ArrayList<String>();
ArrayList<String> predicate = new ArrayList<String>();
ArrayList<String> object = new ArrayList<String>();
subject.add("<http://people.brunel.ac.uk/~csstnns/university.owl#"+fName+">");
predicate.add("<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>");
object.add("<http://people.brunel.ac.uk/~csstnns/university.owl#Lecturer>");
for(int i = 0; i < subject.size(); i++){
String qry = "INSERT DATA"+
"{"+
subject.get(i)+"\n"+
predicate.get(i)+"\n"+
object.get(i)+"\n"+
"}";
UpdateRequest update = UpdateFactory.create(qry);
UpdateProcessor qexec = UpdateExecutionFactory.createRemote(update, "http://localhost:3030/ds/update");
qexec.execute();
}
}catch(Exception e){
return false;
}
return true;
}
It would help if you have provided a minimal complete example i.e. you had included your Fuseki configuration and the details of how your OWL file is loaded into Fuseki.
However I will assume you have not used any specific configuration and just launching Fuseki like so:
java -jar fuseki-server-VER.jar --update --loc /path/to/db /ds
So what you've done here is launch Fuseki with updates enabled and using the location /path/to/db as the on-disk TDB database location and the URL /ds for your dataset
The you open your browser and click through Control Panel > /ds and then use the Upload file function to upload your OWL file. When you upload a file it is read into Fuseki and copied into the dataset, in this example your dataset is the on disk TDB database located at /path/to/db.
It is important to understand that no reference to the original file is kept since Fuseki has simply copied the data from the file to the dataset.
You then use the SPARQL Update form to add some data (or in your case you do this via Java code). The update is applied to the dataset which to reiterate is in this example the on disk TDB database located at /path/to/db which has no reference to the original file. Therefore your original file will not change.
Using SPARQL Update to update the original file
If Fuseki is not essential then you could just load your file into local memory and run the update there instead:
Model m = ModelFactory.createDefaultModel();
m.read("example.owl", "RDF/XML");
// Prepare your update...
// Create an UpdateExecution on the local model
UpdateProcessor processor = UpdateExecutionFactory.create(update, GraphStoreFactory.create(m));
processor.execute();
// Save the updated model
updated.write(new FileOutputStream("example.owl"), "RDF/XML");
However if you want to/must stick with using Fuseki you can update your original file by retrieving the modified graph from Fuseki and writing it back out to your file e.g.
DatasetAccessor accessor = DatasetAccessorFactory.createHTTP("http://localhost:3030/ds/data");
// Download the updated model
Model updated = accessor.getModel();
// Save the updated model over the original file
updated.write(new FileOutputStream("example.owl"), "RDF/XML");
This example assumes that you have loaded the OWL file into the default graph, if not use the getModel("http://graph") overload to load the relevant named graph

Overwriting SQLite database on a shared network folder

I have an SQLite3 database on a shared folder. I want to overwrite the db file from a Java application. Though there is low read and write traffic on this file, I want to ensure that a) the overwrite doesn't corrupt the db file, and b) anyone who might be looking to access the db file will essentially see it locked until the overwrite is complete. My current plan looks something like this...
String query = "BEGIN EXCLUSIVE TRANSACTION";
/* Execute this query*/
File sourceFile = new File(LocalPath);
File destFile = new File(DbPath);
InputStream inStream = new FileInputStream(sourceFile);
OutputStream outStream = new FileOutputStream(destFile);
byte[] buffer = new byte[1024];
int length;
while((length = inStream.read(buffer)) > 0) {
outStream.write(buffer, 0, length);
}
inStream.close();
outStream.close();
/* Now release lock */
query = "ROLLBACK TRANSACTION";
/* Execute query */
So reading guidance from SQLite here http://www.sqlite.org/howtocorrupt.html, it would seem this lock would exist in the journal and be updated after copy when I run the rollback transaction. In the meantime, if a client tries to access the Db while I'm copying, I imagine they would just not find the file and my SQLite driver assume the db doesn't exist. Right?
My question is... is it a moot point to place a lock on the db? Is there a better strategy or a way to make the db appear locked rather than missing? Also, am I running a huge risk in DB corruption that makes this unfeasible? One other thought I've had would be to lock the db file, write the new file to a different name, then rename after writing is complete, then release the lock... any thoughts on that?
Not sure if overwriting the DB file is a brilliant idea, but the only feasible thing I can think of for the resources I have at my disposal (running huge transactions on a db in a shared folder over the network is unacceptably slow. I am aware I run higher risk of corruption working with an SQLite db on a shared folder). I am writing updates locally and then letting the user elect to "commit" changes and initiate the db file copy.
In addition to answering my questions, any general advice on this case is welcome...
The Backup API allows you to overwrite a database.
(I don't know if your Java wrapper exposes this API.)

Categories