Remove Attributes by Name. Filter broken? - java

There is an attribute filter which should remove each attribute which is matching a specified regular Expression from a set of Instances.
I have problems with the RegEx.
I tried several simple which all are valid (tested on regexr).
But the Filter seems to not accept them.
Following the relevant code.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInputFormat(dataset1_x);
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
//filterX.setExpression("^.*i$"); also don't work
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);
This should match all names ending with an "i".
The resulting dataset is named
"dataset-weka.filters.unsupervised.attribute.StringToNominal-Rlast-weka.filters.unsupervised.attribute.Remove-weka.filters.unsupervised.attribute.RemoveByName-E^.*id$"
Note that ^.*id$ is the default expression. It has not changed.
Although filterX.getExpression(); gives the correct regex set before.
Also this usage of the filter corresponds to several code-examples.
Same if I set the regex using Filter.setOptions();
This is an issue of version 3.9.0 dev and also 3.8 stable.
Using the WEKA-GUI, the filter is working correctly.
Thus another assumption is that if entered programmatically, the regex must have a special format.. Unfortunately the API does not provide examples..

You need to set the expression and the InvertSelection-flag before setting the input format.
More generally i assume that you have to set all option before setting the inputFormat.
Following is working.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
filterX.setInputFormat(dataset1_x);
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);

Related

Regex to match conan dependency from conanfile.txt

I am trying to create a regex in Java to match and get the name, version, channel and owner for each dependency but I haven't been able to have one that covers all the possible scenarios:
the structure is something like name/version#owner/channel, where the version might have a semver structure, the owner and channel are optional.
Currently, I have :
^(?<name>[\d\w][\d\w\+\.-]+)\/(?<version>[\d\w][\d\w\.-]+)(#(?<owner>\w+))?(\/(?<channel>.+))?$
but it's failing for boost_atomic/1.59.0+4#owner/release, since the +4 is not matched and I need the value before that -> 1.59.0
Some other scenarios that need to be valid and are valid for the regex above are:
Poco/1.9.0#pocoproject/stable
zlib/1.2.11#conan/stable
freetype/2.10.1/stable
openssl/1.0.2g/stable
openssl/1.0.2g
openssl/1.0.2g#owner
Also, there might be some dependencies with comments :
zlib/1.2.11#conan/stable # comment
In that case I would need to get rid of the component and only get the relevant information with the regex.
I am not sure if my current regex is good, but from what I've tested only some scenarios are missing
You can simplify your regex and avoid putting too many characters in that character set and escaping them, instead use something like [^\/] to capture anything except / as you want to capture anything preceding a slash.
I've made some modifications and the updated regex that should work for you is following,
^(?<name>[^\/]+)\/(?<version>[^\/#\s]+)(#(?<owner>\w+))?(\/(?<channel>\S+))?(?:\s*#\s*(?<comment>.+))?$
I've added another named group for comment as you mentioned that can also be present. Let me know if this works for you.
Try this demo
Edit: If channel contains a text like release:132434 and anything followed by a colon is to be ignored as part of channel, you can use updated regex below,
^(?<name>[^\/]+)\/(?<version>[^\/#\s]+)(?:#(?<owner>\w+))?(?:\/(?<channel>[^:\s]+)\S*)?(?:\s*#\s*(?<comment>.+))?\s*$
Updated Demo

How to get rid of files with names bin$

Using Jooq generator, by Gradle plugin, I am getting now with POJOs and tables not only classes with normal names, bu also heaps of files whose names start by bin$.
They are not necessary, for only yesterday the generator did not make these files. And everything works OK with or without them. But I don't want the project to be littered with tens of excessive files.
Since 10'th version, Oracle puts the dropped tables to the recycle bin. They have names starting by Bin$. So, JooQ simply makes classes for dropped tables. That could be blocked in two ways: To stop use recycling bean in Oracle or to filter the tables for which the Jooq generator makes classes.
ALTER SYSTEM SET RECYCLEBIN = OFF DEFERRED;
purge dba_recyclebin;
or to change the generator setting (the example is for Gradle)
generator{
...
database {
...
excludes = '(?i:BIN\\$.*)'
Edit: Finally after several attempts (by Lukas) and checks (by me) Lukas had found the correct meaning for excludes. Its form, IMHO, has the only explanation - JOOQ doesn't work with regex'es correctly, for Groovy does not parse the strings in single quotes.
jOOQ's <excludes/> setting is a Java regular expression. You have to properly form it like this:
excludes = '(?i:BIN\\$.*)'
Explanation:
Use (?i:...) for case-insensitivity. Just in case. Pun intended.
Use \\ before the $ sign because the $ means "end of line" in regular expressions. You want to escape that. And because Groovy/Gradle parses (as in "look for escape sequences") your string, you need to escape the backslash too, for it to reach the Java Pattern.compile() call
Use .* to indicate that after the $, you want to match any number of characters. . = any character and * = any number of repetitions

ReplaceText Usage in Apache NiFi

I am trying to make use of the out-of-the-box processor ReplaceText in Apache NiFi to search inside a .dsv file, match all datetime formats and convert them into dates. I am not sure, however, how to configure the processor itself. I have tried to set my search value (Search Value Property) to something like this:
(0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
Whereas my Replacement Value is regex1 that matches to ${time:format("yyyy-MM-dd'")} I have also set up another property named time that in turns matches to (0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
This does not work and I have the feeling I am not using ReplaceText as it should be. Can you help?
EDIT:
I should have included that I am using the Replacement Strategy called Regex Replace and Evaluation Mode Entire text.
I believe a similar question was answered on the Apache mailing list, for reference:
I created a template [1] that shows an example of how to do the date conversion you described. It is linked to from the main templates page on the wiki [2] and is named "DateConversion.xml"
It first uses ExtractText to find the date string and extract it into an attribute called "date". The regular expression used is: (\d{2}-\d{2}-\d{4} \d{2}.\d{2}.\d{2})
Then it uses ReplaceText with the Search Value of the same regular expression above, to replace that with ${date:toDate("dd-MM-yyyy HH.mm.ss"):format("yyyy-MM-dd HH:mm:ss+0000")}
[1] https://cwiki.apache.org/confluence/download/attachments/57904847/DateConversion.xml?version=2&modificationDate=1462288576652&api=v2
[2] https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates

How to apply multiple filters in google analytics

How to filter multiple dimensions in Google analytics.
Nether of the following work:
.setFilters("ga:userType==anonymous").setFilters( "ga:dimension3==1234")
.setFilters("ga:userType==anonymous","ga:dimension3==1234")
The second one gives an error.
You need to string them together.
Combining Filters Filters can be combined using OR and AND boolean
logic. This allows you to effectively extend the 128 character limit
of a filter expression.
The OR operator is defined using a comma (,).
ga:country==United%20States,ga:country==Canada
The AND operator is defined using a semi-colon (;).
ga:country==United%20States;ga:browser==Firefox
I am not sure what language that is but its probably going to be more like
setFilters("ga:userType==anonymous,ga:dimension3==1234")
.setFilters takes only one string
.setFilters(String, String) gives an error.
Putting 2 lines
.setFilters(String1)
.setFilters(String2)
does not filter the data as desired.
As a work around, I created a segment and put all the filters there and used that segment in my data pull. so working for now but still looking for the filter code

Struts2 JSON plugin includeProperties - not full regex support?

The documentation for Struts2 plugin version 2.3.x says that the includeProperties parameter can be set, and behaves as follows:
A comma-delimited list of regular expressions can be passed to the
JSON Result to restrict which properties will be serialized. ONLY
properties matching any of these regular expressions will be included
in the serialized output.
However, from my own testing, this doesn't appear to be the case. At any rate, it does not seem to support full regex syntax as one might expect (i.e. the full set of expressions that would work with java.util.regex.Pattern).
Take a simple example where we might want to use the greedy optional quantifier ("?") with a group. To make things concrete, this pattern: ^(items\\[\\d+\\]\\.)?userName$ does not work; it is ignored and your includeProperties ends up being null.
But if you instead just use ^items\\[\\d+\\]\\.userName$ then it works (the pattern is recognized and added). Looking through org.apache.struts2.json.JSONUtil source shows that there is a lot of custom code written to process the patterns.
It's not mentioned in the JSON plugin documentation that only a special subset of regex is supported. What is the story on which types of expressions are supported or not supported by this plugin?

Categories