How to read a file from minIO in apache beam java sdk

How to read a file from minIO in apache beam java sdk - java

I just started with minio and apache beam. I have created a bucket on play.min.io and added few files (let suppose files stored are one.txt and two.txt). I want to access the files stored on that bucket with Apache beam java sdk. When i deal with local files i just pass the path of file like C://new//.. but i don't know how to get files from minio. Can anyone help me with the code.

I managed to have it work with some configurations on top of the standard AWS configuration :
AwsServiceEndpoint should point to your minio server (here localhost:9000).
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
...
options.as(AwsOptions.class).setAwsServiceEndpoint("http://localhost:9000");
PathStyleAccess has to be enabled (so that bucket access does not translate to a request to "http://bucket.localhost:9000" but to "http://localhost:9000/bucket").
This can be done by extending DefaultS3ClientBuilderFactory with this kind of MinioS3ClientBuilderFactory :
public class MinioS3ClientBuilderFactory extends DefaultS3ClientBuilderFactory {
#Override
public AmazonS3ClientBuilder createBuilder(S3Options s3Options) {
AmazonS3ClientBuilder builder = super.createBuilder(s3Options);
builder.withPathStyleAccessEnabled(true);
return builder;
}
}
and inject it in the options like this :
Class<? extends S3ClientBuilderFactory> builderFactory = MinioS3ClientBuilderFactory.class;
options.as(S3Options.class).setS3ClientFactoryClass(builderFactory);

Related

How can I access document filename or URL in custom uima annotator using IBM Content Analytics?

I am writing a custom java annotator for our UIMA pipeline in Watson Explorer Content Analytics.
There are two places (I know of ) where I can try to get the URL or Filename of the document that is currently being processed.
Initialize
public class CustomAnnotator extends JCasAnnotator_ImplBase {
#Override
public void initialize(UimaContext aContext)
throws ResourceInitializationException {
super.initialize(aContext);
.... HERE MAYBE ? ....
Or
Process
#Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
try {
.... HERE ....
I have tried several options:
via context in initialize method(Running the pipeline on the server , I could get the PearID for example),
via the Sofa in the process method (e.g. jcas.getSofa().getSofaURI())
I also found SourceDocumentInformation , but this is an example and although the method getUri() seems promising, I depend on IBM to implement the setUri(String) method...
But so far I have not been successful, I hope I have overlooked something...

I asked the same question on IBM dwanwsers.
In short, you can access multiple views when the pipeline runs in the Watson Explorer Content Analytics server. For metadata we need to inspect the _InitialView and not the rlw-view, which is the one that holds all annotations created by the custom pipeline you create in Content Analytics Studio
More details can be found here, also look at the reponses !
https://www.ibm.com/developerworks/community/blogs/ibmandgoogle/entry/Exporting_annotations_from_Watson_Explorer_Content_Analytics?lang=en

Google Cloud Dataflow: Submitted job is executing but using old code

I'm writing a Dataflow pipeline that should do 3 things:
Reading .csv files from GCP Storage
Parsing the data to BigQuery campatible TableRows
Writing the data to a BigQuery table
Up until now this all worked like a charm. And it still does, but when I change the source and destination variables nothing changes. The job that actually runs is an old one, not the recently changed (and committed) code. Somehow when I run the code from Eclipse using the BlockingDataflowPipelineRunner the code itself is not uploaded but an older version is used.
Normally nothing wrong with the code but to be as complete as possible:
public class BatchPipeline {
String source = "gs://sourcebucket/*.csv";
String destination = "projectID:datasetID.testing1";
//Creation of the pipeline with default arguments
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> line = p.apply(TextIO.Read.named("ReadFromCloudStorage")
.from(source));
#SuppressWarnings("serial")
PCollection<TableRow> tablerows = line.apply(ParDo.named("ParsingCSVLines").of(new DoFn<String, TableRow>(){
#Override
public void processElement(ProcessContext c){
//processing code goes here
}
}));
//Defining the BigQuery table scheme
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("datetime").setType("TIMESTAMP").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("consumption").setType("FLOAT").setMode("REQUIRED"));
fields.add(new TableFieldSchema().setName("meterID").setType("STRING").setMode("REQUIRED"));
TableSchema schema = new TableSchema().setFields(fields);
String table = destination;
tablerows.apply(BigQueryIO.Write
.named("BigQueryWrite")
.to(table)
.withSchema(schema)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withoutValidation());
//Runs the pipeline
p.run();
}
This problem arose because I've just changed laptops and had to reconfigure everything. I'm working on a clean Ubuntu 16.04 LTS OS with all the dependencies for GCP development installed (normally). Normally everything is configured quite well since I'm able to start a job (which shouldn't be possible if my config is erred, right?). I'm using Eclipse Neon btw.
So where could the problem lie? It seems to me that there is a problem uploading the code, but I've made sure that my cloud git repo is up-to-date and the staging bucket has been cleaned up ...
**** UPDATE ****
I never found what was exactly going wrong but when I checked out the creation dates of the files in my deployed jar, I indeed saw that they were never really updated. The jar file itself had however a recent timestamp which made me overlook that problem completely (rookie mistake).
I eventually got it all working again by simply creating a new Dataflow project in Eclipse and copying my .java files from the broken project into the new one. Everything worked like a charm from then on.

Once you submit a Dataflow job, you can check which artifacts were part of the job specification by inspecting the files that are part of the job description which is available via DataflowPipelineWorkerPoolOptions#getFilesToStage. The code snippet below gives a little sample of how to get this information.
PipelineOptions myOptions = ...
myOptions.setRunner(DataflowPipelineRunner.class);
Pipeline p = Pipeline.create(myOptions);
// Build up your pipeline and run it.
p.apply(...)
p.run();
// At this point in time, the files which were staged by the
// DataflowPipelineRunner will have been populated into the
// DataflowPipelineWorkerPoolOptions#getFilesToStage
List<String> stagedFiles = myOptions.as(DataflowPipelineWorkerPoolOptions.class).getFilesToStage();
for (String stagedFile : stagedFiles) {
System.out.println(stagedFile);
}
The above code should print out something like:
/my/path/to/file/dataflow.jar
/another/path/to/file/myapplication.jar
/a/path/to/file/alibrary.jar
It is likely that the resources part of the job that your uploading are out of date in some way containing your old code. Look through all the directories and jar parts of the staging list and find all instances of BatchPipeline and verify their age. jar files can be extracted using the jar tool or any zip file reader. Alternatively use javap or any other class file inspector to validate that the BatchPipeline class file lines up with the expected changes you have made.

WebDAV FileSystemProvider - Java NIO

I have an Java application with lots of NIO methods like Files.copy, Files.move, Files.delete, FileChannel...
What I now trying to achieve: I want to access a remote WebDAV server and modify data on that server with the basic functions like upload, delete or update the remote WebDAV data - without changing every method on my application. So here comes my idea:
I think an WebDAV FileSystem implementation would do the trick. Adding a custom WebDAV FileSystemProvider which is managing the mentioned file operations on the remote data. I've googled a lot and the Apache VFS with Sardine implementation looks good - BUT it seems that the Apache VFS is not compatible with NIO?
Here's some example code, as I imagine it:
public class WebDAVManagerTest {
private static DefaultFileSystemManager fsManager;
private static WebdavFileObject testFile1;
private static WebdavFileObject testFile2;
private static FileSystem webDAVFileSystem1;
private static FileSystem webDAVFileSystem2;
#Before
public static void initWebDAVFileSystem(String webDAVServerURL) throws FileSystemException, org.apache.commons.vfs2.FileSystemException {
try {
fsManager = new DefaultFileSystemManager();
fsManager.addProvider("webdav", new WebdavFileProvider());
fsManager.addProvider("file", new DefaultLocalFileProvider());
fsManager.init();
} catch (org.apache.commons.vfs2.FileSystemException e) {
throw new FileSystemException("Exception initializing DefaultFileSystemManager: " + e.getMessage());
}
String exampleRemoteFile1 = "/foo/bar1.txt";
String exampleRemoteFile2 = "/foo/bar2.txt";
testFile1 = (WebdavFileObject) fsManager.resolveFile(webDAVServerURL + exampleRemoteFile1);
webDAVFileSystem1 = (FileSystem) fsManager.createFileSystem(testFile1);
Path localPath1 = webDAVFileSystem1.getPath(testFile1.toString());
testFile2 = (WebdavFileObject) fsManager.resolveFile(webDAVServerURL + exampleRemoteFile2);
webDAVFileSystem2 = (FileSystem) fsManager.createFileSystem(testFile2);
Path localPath2 = webDAVFileSystem1.getPath(testFile1.toString());
}
}
After that I want to work in my application with localPath1 + localPath2. So that e.g. a Files.copy(localPath1, newRemotePath) would copy a file on the WebDAV server to a new directory.
Is this the right course of action? Or are there other libraries to achieve that?

Apache VFS uses it's own FileSystem interface not the NIO one. You have three options with varying levels of effort.
Change your code to use an existing webdav project that uses it's own FileSystem ie Apache VFS.
Find an existing project that uses webdav and implements NIO FileSystem etc.
Implement the NIO FileSystem interface yourself.
Option 3 has already been done so you may be able to customize what someone else has already written, have a look at nio-fs-provider or nio-fs-webdav. I'm sure there are others but these two were easy to find using Google.
Implementing a WebDav NIO FileSystem from scratch would be quite a lot of work so I wouldn't recommend starting there, I'd likely take what someone has done and make that work for me ie Option 2.

LibreOffice Mail Merge with Java

I'm trying to use the libre office mail merge functionality automatically from an java application.
I have tried to install the libreoffice sdk but without success because they require software that is not available anymore (e.g. zip-tools). Anyway I was able to get the jar files (jurtl-3.2.1.jar, ridl-3.2.1.jar, unoil-3.2.1.jar and juh-3.2.1.jar) from the maven repository.
With this jar files I was able to reproduce a lot of examples which are provided here http://api.libreoffice.org/examples/examples.html#Java_examples
Also in the LibreOffice API documentation a service named 'MailMerge' is listed (see here http://api.libreoffice.org/docs/idl/ref/servicecom_1_1sun_1_1star_1_1text_1_1MailMerge.html)
But in none of the jar's this service class is available, the only instance available for me is MailMergeType.
I'm able to open an *.odt templatefile within my javacode and the next step would be to create an instance of the mail merge service and pass a *.csv datasourcefile to the mail merge service.
At the API documentation some functions are listed which could help me but as I said before I'm not able to get access to this service class because its simply not exist in the provided jar files.
Do anybody know how I can get access to the mail merge service for libreoffice?
If you need more information about my environment just ask.
Sincerly

Looking at this code from 2004, apparently you can simply use Java's Object class. Here are a few snippets from that code:
Object mmservice = null;
try {
// Create an instance of the MailMerge service
mmservice = mxMCF.createInstanceWithContext(
"com.sun.star.text.MailMerge", mxComponentContext);
}
// Get the XPropertySet interface of the mmservice object
XPropertySet oObjProps = (XPropertySet)
UnoRuntime.queryInterface(XPropertySet.class, mmservice);
try {
// Set up the properties for the MailMerge command
oObjProps.setPropertyValue("DataSourceName", mDataSourceName);
}
// Get XJob interface from MailMerge service and call execute on it
XJob job = (XJob) UnoRuntime.queryInterface(XJob.class, mmservice);
try {
job.execute(new NamedValue[0]);
}
See also How to do a simple mail merge in OpenOffice.
Regarding a source for the old zip tools, try zip.exe from http://www.willus.com/archive/zip64/.

rackspace cloudfiles throws ContainerNotFoundException after migration from jclouds 1.5 to 1.7

I am trying to update the jclouds libs we use from version 1.5 to 1.7.
We access the api the following way:
https://github.com/jclouds/jclouds-examples/tree/master/rackspace/src/main/java/org/jclouds/examples/rackspace/cloudfiles
private RestContext<CommonSwiftClient, CommonSwiftAsyncClient> swift;
BlobStoreContext context = ContextBuilder.newBuilder(PROVIDER)
.credentials(username, apiKey)
.buildView(BlobStoreContext.class);
swift = context.unwrap();
RestContext is deprecated since 1.6.
http://demobox.github.io/jclouds-maven-site-1.6.0/1.6.0/jclouds-multi/apidocs/org/jclouds/rest/RestContext.html
I tried to get it working this way:
ContextBuilder contextBuilder = ContextBuilder.newBuilder(rackspaceProvider)
.credentials(rackspaceUsername, rackspaceApiKey);
rackspaceApi = contextBuilder.buildApi(CloudFilesClient.class);
At runtime, uploading a file i get the following error:
org.jclouds.blobstore.ContainerNotFoundException
The examples in the jclouds github project seem to use the deprecated approach (Links mentioned above).
Any ideas how to solve this? Any alternatives?

Does the container that you're uploading into exist? The putObject method doesn't automatically create the container that you name if it doesn't exist; you need to call createContainer explicitly to create it, first.
Here's an example that creates a container and uploads a file into it:
CloudFilesClient client = ContextBuilder.newBuilder("cloudfiles-us")
.credentials(USERNAME, APIKEY)
.buildApi(CloudFilesClient.class);
client.createContainer("sample");
SwiftObject object = client.newSwiftObject();
object.getInfo().setName("somefile.txt");
object.setPayload("file or bytearray or something else here");
client.putObject("sample", object);
// ...
client.close();
You're right that the examples in jclouds-examples still reference RestClient, but you should be able to translate to the new style by substituting your rackspaceApi object where they call swift.getApi().

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read a file from minIO in apache beam java sdk - java

Related

How can I access document filename or URL in custom uima annotator using IBM Content Analytics?

Google Cloud Dataflow: Submitted job is executing but using old code

WebDAV FileSystemProvider - Java NIO

LibreOffice Mail Merge with Java

rackspace cloudfiles throws ContainerNotFoundException after migration from jclouds 1.5 to 1.7

Categories

Resources