We have a functionality to create a number of folders when user saves the data in CM.
The format would is attached in image:
ParentFolder
ChildFolder1
ChildFolder2
ChildFolder3
File1
File2
File3
ParentFolderConfig
ChildFolderConfig1
ChildFolderConfig2
ChildFolderConfig3
FileConfig1
FileConfig2
FileConfig3
These all are created all the times when user creates it. I have found a way to add nodes one by one using addNode(). But to save the time and increase performance I wanted to find out a way in which I create this files and folder temporary in JAVA and save them to JCR in one call and afterwards dispose these temporary files.
Calling addNode() multiple times and saving a the end with Session.save() is a common pattern in JCR, it's perfectly fine to create your structure like that.
To make your code simpler you could use a utility class that takes the path of a node that's deep in your hierarchy, and creates intermediate nodes as needed. The JcrUtils.getOrCreateByPath method provided by the Jackrabbit commons module does that.
Related
I have folder with 1.5 millions of objects (about 5 TB of data) which has folders with the next format 123-John.
I need to copy all these folders content in the new folders with renaming it to format 123.
I want to do it by the means of java.
Obviously I can't just do it one by one like this:
ObjectListing objectListing = s3.listObjects(listObjectsRequest);
boolean processable = true;
while (processable) {
processable = objectListing.isTruncated();
renameAndCopyOneByOne(objectListing.getObjectSummaries()); // this edits name and makes call to s3.copyObject()
if (processable) {
objectListing = s3.listNextBatchOfObjects(objectListing);
}
}
it would lead to making about 1.5 millions calls to
s3.copyObject(bucket, sourceKey, bucket, destinationKey)
I wanted to do it with batch , but the thing is that it could be done only with creating of manifest file in CSV format with format like
bucketName,keyName
But this is just manifest for the objects I want to make action to. I can't list locations where to save to and specify edited folder name. And also I still have to split CSV with 1.5 millions into smaller ones and create several request to S3 to create several jobs which would be not obvious to track.
Could you please give me a hint what from AWS tools would perfectly suffice all my needs for this task?
Well, after some time spent on how to do it properly I think the only way is to make such migration by some batch job from Java, to split the load.
Because AWS does not have proper tool for my case.
In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.
I want to store my blobs outside of the database in files, however they are just random blobs of data and aren't directly linked to a file.
So for example I have a table called Data with the following columns:
id
name
comments
...
I can't just include a column called fileLink or something like that because the blob is just raw data. I do however want to store it outside of the database. I would love to create a file called 3.dat where 3 is the id number for that row entry. The only thing with this setup is that the main folder will quickly start to have a large number of files as the id is a flat folder structure and there will be OS file issues. And no the data is not grouped or structured, it's one massive list.
Is there a Java framework or library that will allow me to store and manage the blobs so that I can just do something like MyBlobAPI.saveBlob(id, data); and then do MyBlobAPI.getBlob(id) and so on? In other words something where all the File IO is handled for me?
Simply use an appropriate database which implements blobs as you described, and use JDBC. You really are not looking for another API but a specific implementation. It's up to the DB to take care of effective storing of blobs.
I think a home rolled solution will include something like a fileLink column in your table and your api will create files on the first save and then write that file on update.
I don't know of any code base that will do this for you. There are a bunch that provide an in memory file system for java. But it's only a few lines of code to write something that writes and reads java objects to a file.
You'll have to handle any file system limitations yourself. Though I doubt you'll ever burn through the limitations of modern file systems like btrfs or zfs. FAT32 is limited to 65K files per directory. But even last generation file systems support something on the order of 4 billion files per directory.
So by all means, write a class with two functions. One to serialize an object to a file; given it a unique key as a name. And another to deserialize the object by that key. If you are using a modern file system, you'll never run out of resources.
As far as I can tell there is no framework for this. The closest I could find was Hadoop's HDFS.
That being said the advice of just putting the BLOB's into the database as per the answers below is not always advisable. Sometimes it's good and sometimes it's not, it really depends on your situation. Here are a few links to such discussions:
Storing Images in DB - Yea or Nay?
https://softwareengineering.stackexchange.com/questions/150669/is-it-a-bad-practice-to-store-large-files-10-mb-in-a-database
I did find some addition really good links but I can't remember them offhand. There was one in particular on StackOverFlow but I can't find it. If you believe you know the link please add it in the comments so that I can confirm it's the right one.
Using Java, I am creating a program that indexes a folder structure and allows a user to search for files and also tag a file with keywords and then search for files based off of those tags.
I have been traversing through the folder hierarchy using the FileUtils listFiles method at the moment which is essentially this question: Recursively list files in Java
I haven't yet begun to code the tagging functionality, but thinking ahead I'm fearing that if a file is renamed or moved after I associate it with a tag then it will lose the tag. This defeats the purpose of my program, so can anybody offer suggestions as to how to store each file located in the folder hierarchy or associate the tag so that if a file is renamed or moved it will still have the tag associated with it.
If you want to keep track of a file, even when its name and/or location changes, you should use its unique identifier, which in most file systems is called its inode. (I think NTFS/Windows calls it a "file ID.") You can read a file's inode using its BasicFileAttributes.fileKey:
Object key = Files.getAttribute(file.toPath(), "fileKey");
That key is suitable for use as a HashMap key.
If the OS doesn't support file tagging, you could:
maintain a mapping of file path to tags
maintain a mapping of file hash to tags
Using option #2, your tags would be preserved even if a file was moved. But if someone moved AND modified the file, then the tags would be lost.
I don't think there's a way to do this without updating your tag relationship to the newly create file since the rename/mv operation is at the disk level and there is actually a delete and 'create' file compound action happening in the background. Because of that, there's no guarantee that a file will even be in the same place on the disk. If you know for sure that the file will have the same contents, you could take an MD5 signature of the file's contents in a String object and then always compare those when a tag is queried, of course this has its down falls too when the file's content changes.
Your best bet is to use a hash map w/ the files' paths and tags and then use directory watcher to update the hash map when a file name changes. Thats the best I can think of!
I am making a java program that has a collection of flash-card like objects. I store the objects in a jtree composed of defaultmutabletreenodes. Each node has a user object attached to it with has a few string/native data type parameters. However, i also want each of these objects to have an image (typical formats, jpg, png etc).
I would like to be able to store all of this information, including the images and the tree data to the disk in a single file so the file can be transferred between users and the entire tree, including the images and parameters for each object, can be reconstructed.
I had not approached a problem like this before so I was not sure what the best practices were. I found XLMEncoder (http://java.sun.com/j2se/1.4.2/docs/api/java/beans/XMLEncoder.html) to be a very effective way of storing my tree and the native data type information. However I couldn't figure out how to save the image data itself inside of the XML file, and I'm not sure it is possible since the data is binary (so restricted characters would be invalid). My next thought was to associate a hash string instead of an image within each user object, and then gzip together all of the images, with the hash strings as the names and the XMLencoded tree in the same compmressed file. That seemed really contrived though.
Does anyone know a good approach for this type of issue?
THanks!
Thanks!
Assuming this isn't just a serializable graph, consider bundling the files together in Jar format. If you already have your data structures working with XMLEncoder, you can reuse this code by saving the data as a jar entry.
If memory serves, the jar library has better support for Unicode name entries than the zip package, which is why I would favour it.
You might consider using an MS JET database (.mdb file) and storing all the stuff in there. That'll also make it easy to examine and edit the data in (for example) MS Access.
You can employ some virtual file system, which stores it's data in a single container. We develop and offer one of such files sytems, SolFS, however right now there's no Java binding for it. We will release Java JNI interface for SolFS within a month.