ReplaceText Usage in Apache NiFi - java

I am trying to make use of the out-of-the-box processor ReplaceText in Apache NiFi to search inside a .dsv file, match all datetime formats and convert them into dates. I am not sure, however, how to configure the processor itself. I have tried to set my search value (Search Value Property) to something like this:
(0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
Whereas my Replacement Value is regex1 that matches to ${time:format("yyyy-MM-dd'")} I have also set up another property named time that in turns matches to (0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
This does not work and I have the feeling I am not using ReplaceText as it should be. Can you help?
EDIT:
I should have included that I am using the Replacement Strategy called Regex Replace and Evaluation Mode Entire text.

I believe a similar question was answered on the Apache mailing list, for reference:
I created a template [1] that shows an example of how to do the date conversion you described. It is linked to from the main templates page on the wiki [2] and is named "DateConversion.xml"
It first uses ExtractText to find the date string and extract it into an attribute called "date". The regular expression used is: (\d{2}-\d{2}-\d{4} \d{2}.\d{2}.\d{2})
Then it uses ReplaceText with the Search Value of the same regular expression above, to replace that with ${date:toDate("dd-MM-yyyy HH.mm.ss"):format("yyyy-MM-dd HH:mm:ss+0000")}
[1] https://cwiki.apache.org/confluence/download/attachments/57904847/DateConversion.xml?version=2&modificationDate=1462288576652&api=v2
[2] https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates

Related

Different result between Javascript and Java regular expression matches

Now I am trying to match some patterns from a String containing elasticsearch's structured bulk requests. Here is an example:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]}, update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}, delete {[event_20191208][_doc][sjdos]}, update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
My goal is to match every separate request out of the bulk requests string, i.e to get strings like:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]},
update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]},
delete {[event_20191208][_doc][sjdos]},
update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
And my pattern expression is [a-z]+\s\{.+?\}[,\w\t\r\n]+? which works fine on a Javascript based regular expression online tester like below:
However, when I copied this pattern expression to my Java code, the output was not what I expected. It was like this:
So I realized there exists some differences between Javascript and Java regular expression engine, but I cannot figure out how to update my expression so that it could work well in Java after so much coding and googling.
I would be so grateful if someone could give me some favor or hint for this.
After a short nap, I found epiphany. I was a fool in the morning....
The workaround is so easy to implement. Elasticsearch has well overridden toString() for us.
At first glance, I wouldn't suggest using regex right away. It looks like those lines follow some kind of pattern that you could parse and split up first.
After that, if you're talking about regex, I'd try:
Taking a look at the java regex format: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
How about using an online java regex tool instead?

Remove Attributes by Name. Filter broken?

There is an attribute filter which should remove each attribute which is matching a specified regular Expression from a set of Instances.
I have problems with the RegEx.
I tried several simple which all are valid (tested on regexr).
But the Filter seems to not accept them.
Following the relevant code.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInputFormat(dataset1_x);
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
//filterX.setExpression("^.*i$"); also don't work
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);
This should match all names ending with an "i".
The resulting dataset is named
"dataset-weka.filters.unsupervised.attribute.StringToNominal-Rlast-weka.filters.unsupervised.attribute.Remove-weka.filters.unsupervised.attribute.RemoveByName-E^.*id$"
Note that ^.*id$ is the default expression. It has not changed.
Although filterX.getExpression(); gives the correct regex set before.
Also this usage of the filter corresponds to several code-examples.
Same if I set the regex using Filter.setOptions();
This is an issue of version 3.9.0 dev and also 3.8 stable.
Using the WEKA-GUI, the filter is working correctly.
Thus another assumption is that if entered programmatically, the regex must have a special format.. Unfortunately the API does not provide examples..
You need to set the expression and the InvertSelection-flag before setting the input format.
More generally i assume that you have to set all option before setting the inputFormat.
Following is working.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
filterX.setInputFormat(dataset1_x);
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);

What is the regular expression for starts with using regexp filter on elastic search?

I'm working on a Search Engine using Elastic Search - I'm using its java API. And would like to configure a regexp filter for my queries particularly a "starts with" filter.
Suppose I have these titles in my Index:
the world
things about him
george's ultimatum
jumping
jimmy and the flock
If I would like to get the results exactly starting with the letter t or th, what regular expression should I use?
CORRECT RESULTS AFTER SEARCH SHOULD BE
the world
things about him
I've tried using:
^t.* OR ^[t.*]
But doesn't return any results. The starting anchor ^ doesn't work on Elastic even though the documentation says so.
t.* OR [t.*]
But it works just like the prefix filter, and includes the result "jimmy and the flock"
Note:
I cannot use the regexp query (A limitation of the search engine I'm building) so I'm forced to use only a filter
I've tried using the prefix filter but it will evaluate terms, using the prefix parameter "t" for example will include the title "jimmy and the flock" because of "the" term
BTW, I'm using ES version 1.0.0
There is a special page on the ElasticSearch blog that exactly answers your problem: http://www.elasticsearch.org/blog/starts-with-phrase-matching/;
as pickypg suggests, it is a mapping problem, you must set a special analyzer that combines the "keyword" tokenizer and the "lowercase" filter.

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Struts2 JSON plugin includeProperties - not full regex support?

The documentation for Struts2 plugin version 2.3.x says that the includeProperties parameter can be set, and behaves as follows:
A comma-delimited list of regular expressions can be passed to the
JSON Result to restrict which properties will be serialized. ONLY
properties matching any of these regular expressions will be included
in the serialized output.
However, from my own testing, this doesn't appear to be the case. At any rate, it does not seem to support full regex syntax as one might expect (i.e. the full set of expressions that would work with java.util.regex.Pattern).
Take a simple example where we might want to use the greedy optional quantifier ("?") with a group. To make things concrete, this pattern: ^(items\\[\\d+\\]\\.)?userName$ does not work; it is ignored and your includeProperties ends up being null.
But if you instead just use ^items\\[\\d+\\]\\.userName$ then it works (the pattern is recognized and added). Looking through org.apache.struts2.json.JSONUtil source shows that there is a lot of custom code written to process the patterns.
It's not mentioned in the JSON plugin documentation that only a special subset of regex is supported. What is the story on which types of expressions are supported or not supported by this plugin?

Categories