Struts2 JSON plugin includeProperties - not full regex support? - java

The documentation for Struts2 plugin version 2.3.x says that the includeProperties parameter can be set, and behaves as follows:
A comma-delimited list of regular expressions can be passed to the
JSON Result to restrict which properties will be serialized. ONLY
properties matching any of these regular expressions will be included
in the serialized output.
However, from my own testing, this doesn't appear to be the case. At any rate, it does not seem to support full regex syntax as one might expect (i.e. the full set of expressions that would work with java.util.regex.Pattern).
Take a simple example where we might want to use the greedy optional quantifier ("?") with a group. To make things concrete, this pattern: ^(items\\[\\d+\\]\\.)?userName$ does not work; it is ignored and your includeProperties ends up being null.
But if you instead just use ^items\\[\\d+\\]\\.userName$ then it works (the pattern is recognized and added). Looking through org.apache.struts2.json.JSONUtil source shows that there is a lot of custom code written to process the patterns.
It's not mentioned in the JSON plugin documentation that only a special subset of regex is supported. What is the story on which types of expressions are supported or not supported by this plugin?

Related

Different result between Javascript and Java regular expression matches

Now I am trying to match some patterns from a String containing elasticsearch's structured bulk requests. Here is an example:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]}, update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}, delete {[event_20191208][_doc][sjdos]}, update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
My goal is to match every separate request out of the bulk requests string, i.e to get strings like:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]},
update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]},
delete {[event_20191208][_doc][sjdos]},
update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
And my pattern expression is [a-z]+\s\{.+?\}[,\w\t\r\n]+? which works fine on a Javascript based regular expression online tester like below:
However, when I copied this pattern expression to my Java code, the output was not what I expected. It was like this:
So I realized there exists some differences between Javascript and Java regular expression engine, but I cannot figure out how to update my expression so that it could work well in Java after so much coding and googling.
I would be so grateful if someone could give me some favor or hint for this.
After a short nap, I found epiphany. I was a fool in the morning....
The workaround is so easy to implement. Elasticsearch has well overridden toString() for us.
At first glance, I wouldn't suggest using regex right away. It looks like those lines follow some kind of pattern that you could parse and split up first.
After that, if you're talking about regex, I'd try:
Taking a look at the java regex format: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
How about using an online java regex tool instead?

Remove Attributes by Name. Filter broken?

There is an attribute filter which should remove each attribute which is matching a specified regular Expression from a set of Instances.
I have problems with the RegEx.
I tried several simple which all are valid (tested on regexr).
But the Filter seems to not accept them.
Following the relevant code.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInputFormat(dataset1_x);
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
//filterX.setExpression("^.*i$"); also don't work
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);
This should match all names ending with an "i".
The resulting dataset is named
"dataset-weka.filters.unsupervised.attribute.StringToNominal-Rlast-weka.filters.unsupervised.attribute.Remove-weka.filters.unsupervised.attribute.RemoveByName-E^.*id$"
Note that ^.*id$ is the default expression. It has not changed.
Although filterX.getExpression(); gives the correct regex set before.
Also this usage of the filter corresponds to several code-examples.
Same if I set the regex using Filter.setOptions();
This is an issue of version 3.9.0 dev and also 3.8 stable.
Using the WEKA-GUI, the filter is working correctly.
Thus another assumption is that if entered programmatically, the regex must have a special format.. Unfortunately the API does not provide examples..
You need to set the expression and the InvertSelection-flag before setting the input format.
More generally i assume that you have to set all option before setting the inputFormat.
Following is working.
Instances dataset1_x=new Instances(dataset1);
RemoveByName filterX=new RemoveByName();
filterX.setInvertSelection(true);
filterX.setExpression(Pattern.quote("^.*i$"));
filterX.setInputFormat(dataset1_x);
Instances dataset1_=Filter.useFilter(dataset1_x,filterX);

ReplaceText Usage in Apache NiFi

I am trying to make use of the out-of-the-box processor ReplaceText in Apache NiFi to search inside a .dsv file, match all datetime formats and convert them into dates. I am not sure, however, how to configure the processor itself. I have tried to set my search value (Search Value Property) to something like this:
(0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
Whereas my Replacement Value is regex1 that matches to ${time:format("yyyy-MM-dd'")} I have also set up another property named time that in turns matches to (0{0,1}[1-9])|(1/d)|(2/d)|(3[0-1])/(0{0,1}[1-9])|(1[0-2])/([1-9]/d):(0{0,1}/d)|(1/d)|(2[0-4]):(0{0,1}/d)|([1-5]/d)
This does not work and I have the feeling I am not using ReplaceText as it should be. Can you help?
EDIT:
I should have included that I am using the Replacement Strategy called Regex Replace and Evaluation Mode Entire text.
I believe a similar question was answered on the Apache mailing list, for reference:
I created a template [1] that shows an example of how to do the date conversion you described. It is linked to from the main templates page on the wiki [2] and is named "DateConversion.xml"
It first uses ExtractText to find the date string and extract it into an attribute called "date". The regular expression used is: (\d{2}-\d{2}-\d{4} \d{2}.\d{2}.\d{2})
Then it uses ReplaceText with the Search Value of the same regular expression above, to replace that with ${date:toDate("dd-MM-yyyy HH.mm.ss"):format("yyyy-MM-dd HH:mm:ss+0000")}
[1] https://cwiki.apache.org/confluence/download/attachments/57904847/DateConversion.xml?version=2&modificationDate=1462288576652&api=v2
[2] https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates

Detecting High Surrogates in a String using Regular Expressions

I want to check whether a String contains any High Surrogates. In Java I would use Character.isHighSurrogate(c) and this works.
In regex (using the implementation provided by Android 2.3.3 SDK), I was expecting this to work:
[\uD800-\uDBFF]
but it doesn't.
I am using the char: 𫘤 (codepoint: 177700) to test this (works in my java check but not the regex check).
Any ideas?
The regex engine looks at code points, not at code units. It has no choice, because this is a fundamental requirement of UTS#18 Level 1 Unicode support:
Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. (This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. The results of regular expression matching at this level are independent of country or language. At this level, the user of the regular expression engine would need to write more complicated regular expressions to do full Unicode processing.
And so this is true whether in the normal JDK regex engine, or in the Android regex engine that JNIs into the ICU regex library for much better Unicode support than the JDK provides. Amongst other things, ICU meets all Level-1 requirements and also some Level-2 requirements such as full properties (the upcoming 2.7), graphemes, and fancier boundaries. You don’t get to Level 1 before JDK7, and even there it lacks the rest of them. It is very hard to work with Unicode without grapheme support, and impossible without code-point support.
Sometimes you can get these things to find isolated surrogates, or reversed ones, but these are not supposed to occur in data valid for interchange.
In general, you want to stay as far away from any code-unit interface to anything as you possibly can, and use only those APIs that support a code-point interface instead. Code-units are a curse.
Also, stay very far away from the Java preprocessor. You’ll get no joy from your regexes that way. The ICU regex engine supports both \x{ᴄᴏᴅᴇ ᴘᴏɪɴᴛ} and \N{ᴄʜᴀʀɴᴀᴍᴇ}, so you should use those.
Why are you monkeying around with wicked-nasty code units, anyway? They violate the code-point abstraction.
Looking at the documentation for Pattern, there is an example for matching Greek characters linking to Character.UnicodeBlock
Classes for Unicode blocks and categories
\p{InGreek} A character in the Greek block (simple block)
The available constants in that class contain LOW_SURROGATES, assuming the regex impl on android is compatible to the jdk one, I tried the following code:
String test = new String(Character.toChars(177700));
System.out.println(Pattern.compile("\\p{InLowSurrogates}").matcher(test).find());
System.out.println(Pattern.compile("\\p{InLOW_SURROGATES}").matcher(test).find());
Which prints "true" two times, meaning both naming styles work and it correctly detects low surrogates.
Strangely, the same code does not work for high surrogates, i.e. the following lines both print false:
System.out.println(Pattern.compile("\\p{InHighSurrogates}").matcher(test).find());
System.out.println(Pattern.compile("\\p{InHIGH_SURROGATES}").matcher(test).find());

Detecting words that start with an accented uppercase using regular expressions

I want to extract the words that begin with a capital — including accented capitals — using regular expressions in Java.
This is my conditional for words beginning with capital A through Z:
if (link.text().matches("^[A-Z].+") == true)
But I also want words that begin with an accented uppercase character, too.
Do you have any ideas?
Start with http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
\p{javaUpperCase} Equivalent to java.lang.Character.isUpperCase()
To match an uppercase letter at the beginning of the string, you need the pattern ^\p{Lu}.
Unfortunately, Java does not support the mandatory \p{Uppercase} property, necessary for meeting UTS#18’s RL1.2.
That’s hardly the only thing missing from Java regular expressions to meet even Level 1, the most bareboned Basic Unicode Functionality. Without Level 1, you really can’t work with Unicode test using regular expressions. Too much is broken or absent.
UTS#18’s RL1.1 will finally be met with JDK7, but I do not believe there are currently any plans to meet RL1.2, RL1.2a, or any of the others that it’s currently lacking, nor even meeting the two Strong Recommendations. Alas!
Indeed, of the very short list of mandatory properties required by RL1.2, Java is missing the \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{ANY}, and \p{ASSIGNED} properties. Those are all mandatory but either completely missing or else fail to obey The Unicode Standard with respect to their definitions. This is also the problem with the POSIX compatible properties in Java: they’re all broken with respect to UTS#18.
Prior to JDK7, it is also missing the mandatory Script properties. JDK7 does get script properties at long last, but that’s all — nothing else. Java is still light years away from meeting even RL1.2a, which is a daily gotcha for zillions of programmers.
In JDK7, you can finally also two-part properties in the form \p{name=value} if they’re block, script, or general categories. That means these are all the same in JDK7’s Pattern class:
\p{Block=Number_Forms}, \p{blk=Number_Forms}, and \p{InNumber_Forms}.
\p{Script=Latin}, \p{sc=Latin}, \p{IsLatin}, and \p{Latin}.
\p{General_Category=Lu}, \p{GC=Lu}, and \p{Lu}.
However, you still cannot use the the long forms like \p{Lowercase_Letter} and \p{Letter_Number}, and the POSIX-looking properties are all broken from RL1.2a’s perspective. Plus super-basic properties from RL1.2 like \p{White_Space} and \p{Alphabetic} are still missing.
There was some talk of trying to fix \b and \B, which are miserably broken with respect to \w and \W, but I don't know how they’re going to fix all that without fully complying with RL1.2a. And no, I have no idea when they will add those basic properties to Java. You can’t get by without them, either.
To fully work with Unicode using regexes in Java at even Level 1, you really cannot use the standard Pattern class that Java comes with. The easiest way to do so is to instead use JNI to connect up with ICU regex libraries using the Google Android code, which is available.
There do exist other languages that are at least Level-1 compliant (or better) with UTS#18, but if you want to stay within Java, ICU is currently your own real option.
java has an method java.lang.Character.isUpperCase, its not exactly a regular expression, but might satisfy.
http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Character.html#isUpperCase(int)

Categories