How to access file name within a DoFn in an unbounded pipeline

How to access file name within a DoFn in an unbounded pipeline - java

I'm looking for a way to access the name of the file being processed during the data transformation within a DoFn.
My pipeline is as shown below:
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))
.apply(XmlIO.<MyString>readFiles()
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(MyString.class))//<-- This only returns the contents of the file
.apply(ParDo.of(new ProcessRecord()))//<-- I need to access file name here
.apply(ParDo.of(new FormatRecord()))
.apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(new CustomWrite(options));
Each file that is processed is an XML document. While processing the content, I need access to the name of the file being processed too to include in the transformed record.
Is there a way to achieve this?
This post has a similar question, but since i'm trying to use XmlIO I havent found a way to access the file metadata.
Below is the approach I found online, but not sure if there is a way to use it in the pipeline described above.
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))//File Metadata
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))//Readable Files
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(),new TypeDescriptor<ReadableFile>() {} ))
.via((ReadableFile file) -> {
return KV.of(file.getMetadata().resourceId().getFilename(),file);
})
);
Any suggestions are highly appreciated.
Thank you for your time reviewing this.
EDIT:
I took Alexey's advice and implemented a custom XmlIO. It would be nice if we could just extend the class we need and override the appropriate method. However, in this specific case, there was a reference to one method which was protected within the sdk because of which I couldn't easily override what i needed and instead ended up copying a whole bunch of files. While this works for now, I hope in future there is a more straighforward way to access the file metadata in these IO implementations.

I don't think it's possible to do "out-of-box" with a current implementation of of XmlIO since it returns a PCollection<T> where T is a type of your xml record and, if I'm not mistaken, there is no way to add a file name there. Though, you still can try to "reimplement" a ReadFiles and XmlSource in a way that it will return parsed payload and input file metadata.

Related

Can I transform a JSON-LD to a Java object?

EDIT: I changed my mind. I would find a way to generate the Java class and load the JSON as an object of that class.
I just discovered that exists a variant of JSON called JSON-LD.
It seems to me a more structured way of defining JSON, that reminds me XML with an associated schema, like XSD.
Can I create a Java class from JSON-LD, load it at runtime and use it to convert JSON-LD to an instantiation of that class?
I read the documentation of both the implementations but I found nothing about it. Maybe I read them bad?

Doing a Google search brought me to a library that will decode the JSON-LD into an "undefined" Object.
// Open a valid json(-ld) input file
InputStream inputStream = new FileInputStream("input.json");
// Read the file into an Object (The type of this object will be a List, Map, String, Boolean,
// Number or null depending on the root object in the file).
Object jsonObject = JsonUtils.fromInputStream(inputStream);
// Create a context JSON map containing prefixes and definitions
Map context = new HashMap();
// Customise context...
// Create an instance of JsonLdOptions with the standard JSON-LD options
JsonLdOptions options = new JsonLdOptions();
// Customise options...
// Call whichever JSONLD function you want! (e.g. compact)
Object compact = JsonLdProcessor.compact(jsonObject, context, options);
// Print out the result (or don't, it's your call!)
System.out.println(JsonUtils.toPrettyString(compact));
https://github.com/jsonld-java/jsonld-java
Apparently, it can take it from just a string as well, as if reading it from a file or some other source. How you access the contents of the object, I can't tell. The documentation seems to be moderately decent, though.
It seems to be an active project, as the last commit was only 4 days ago and has 30 contributors. The license is BSD 3-Clause, if that makes any difference to you.
I'm not in any way associate with this project. I'm not an author nor have I made any pull requests. It's just something I found.
Good luck and I hope this helped!

see this page: JSON-LD Module for Jackson

Can we parameterize the request file name to the Read method in Karate?

As I'm trying to automate the API testing process, have to pass the XML file to Read method for example,
Given request read ( varXmlFile )
FYI: XML file is present in the same folder where the feature file exists.
Doing this, its throwing an exception like this
com.intuit.karate.exception.KarateException: called: D:\workspace\APIAutomationDemo\target\test-classes\com\org\features\rci_api_testing.feature, scenario: Get Membership Details, line: 15
javascript evaluation failed: read (varXmlFile )
So Karate doesn't allow this way or can we have any other alternative ?
Suggestion please.
Thanks

Please ensure the variable is set:
* def varXmlFile = 'some-xml-file.xml'
Given request read(varXmlFile)
Or just use normally:
Given request read('some-xml-file.xml')

The problem got solved as in the variable varXmlFile holds the file name along with single quote like this 'SampleXmlRequest.xml'.
So I removed the single quote while returning from the method.

How to iterate through all files in IntelliJ plugin project?

I'm trying to create an IntelliJ plugin that iterates over all files in the project folder and parses all the .java files and then makes some changes in them. The problem is that after reading the documentation I don't have a clear idea how to iterate files over the whole project folder, I think I may use PSI files but I am not sure. Does anyone know or has an idea on how to accomplish this?

To iterate all files in project content, you can use ProjectFileIndex.SERVICE.getInstance(project).iterateContent.
Then you can get PSI files from them (PsiManager#findFile), check if they're Java (instanceof PsiJavaFile) and do whatever you like.
If you don't need PSI, you can just check the file type
(VirtualFile#getFileType == JavaFileType.INSTANCE) and perform the modifications via document (FileDocumentManager#getDocument(file)) or VFS (LoadTextUtil#loadText, VfsUtil#saveText).

A possible way is to use AllClassesGetter, like this:
Processor<PsiClass> processor = new Processor<PsiClass>() {
#Override
public boolean process(PsiClass psiClass) {
// do your actual work here
return true;
}
};
AllClassesGetter.processJavaClasses(
new PlainPrefixMatcher(""),
project,
GlobalSearchScope.projectScope(project),
processor
);
processJavaClasses() will look for classes matching a given prefix in a given scope. By using an empty prefix and GlobalSearchScope.projectScope(), you should be able to iterate all classes declared in your project, and process them in processor. Note that the processor handles instances of PsiClass, which means you won't have to parse files manually. To modify classes, you just have to change the tree represented by these PsiClasses.

Merging two .odt files from code

How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?

I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH

Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).

Commons configuration library to add elements

I am using the apache commons configuration library to read a configuration xml and it works nicely. However, I am not able to modify the value of the elements or add new ones.
To read the xml I use the following code:
XMLConfiguration config = new XMLConfiguration(dnsXmlPath);
boolean enabled = config.getBoolean("enabled", true));
int size = config.getInt("size");
To write I am trying to use:
config.setProperty("newProperty", "valueNewProperty");
config.save();
If I call config.getString("newProperty"), I obtain "valueNewProperty", but the xml has not been changed.
Obviously it is not the right way or I am missing something, because it does not work.
Could anybody tell me how to do this?
Thanks in advance.

You're modifying xml structure in memory
The parsed document will be stored keeping its structure. The class also tries to preserve as much information from the loaded XML document as possible, including comments and processing instructions. These will be contained in documents created by the save() methods, too.
Like other file based configuration classes this class maintains the name and path to the loaded configuration file. These properties can be altered using several setter methods, but they are not modified by save() and load() methods. If XML documents contain relative paths to other documents (e.g. to a DTD), these references are resolved based on the path set for this configuration.
You need to use XMLConfiguration.html#save(java.io.Writer) method
For example, after you've done all your modifications save it:
config.save(new PrintWriter(new File(dnsXmlPath)));
EDIT
As mentioned in comment, calling config.load() before calling setProperty() method fixes the issue.

I solved it with the following lines. I was missing the config.load().
XMLConfiguration config = new XMLConfiguration(dnsXmlPath);
config.load();
config.setProperty("newProperty", "valueNewProperty");
config.save();
It is true though that you can used the next line instead of config.save() and works the same.
config.save(new PrintWriter(new File(dnsXmlPath)));

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to access file name within a DoFn in an unbounded pipeline - java

Related

Can I transform a JSON-LD to a Java object?

Can we parameterize the request file name to the Read method in Karate?

How to iterate through all files in IntelliJ plugin project?

Merging two .odt files from code

Commons configuration library to add elements

Categories

Resources