Freemarker regex is not matching on all lowercase substrings - java

So I am following the user guide, which seems straight forward so I'm not sure what it is I am doing wrong. I want to use the matches builtin to find all lower case words in a string. So, taking the example straight from the docs, into my code (with some obvious changes), I always get the Does not match output. Any help is much appreciated:
<#assign res = "<UPPERCASE_WORD<lowercase_word>>"?matches("[a-z]+")>
<#if res>
Matches
<#else>
Does not match
</#if>
One thing that I've noticed between my code and the docs is that the example has spaces and I do not, but I doubt that's the issue as a quick test with < > replaced with spaces shows no difference. I was thinking the regex is incorrect or not supported by Freemarker, but the docs link directly to OracleRegexPattern docs so I think that's OK.

Don't use matches if you don't expect an exact match:
This built-in determines if the string exactly matches the pattern
If you know the appropriate exact regex use it,
For example for lower letters and then upper cases letters use:
?matches("[a-z]+[A-Z]+")>

If you want to check if the string contains [a-z] somewhere, then the regular expression should be ".*[a-z]+.*", because ?matches checks if the pattern matches the whole string.

Related

Regular expression to return results that do not match selection

I work on a product that provides a Java API to extend it.
The API provides a function which
takes a Perl regular expression and
returns a list of matching files.
I want to filter the list to remove all files that end in .xml, .xsl and .cfg; basically the opposite of .*(\.xml|\.xsl|\.cfg).
I have been searching but I haven't been able to get anything to work yet.
I tried .*(?!\.cfg) and ^((?!cfg).)*$ and \.(?!cfg$|?!xml$|?!xsl$).
I don't know if I am on the right track or not.
Note
I know the regex systems are similar, but I can't get a Java regex working either.
You may use
^(?!.*\.(x[ms]l|cfg)$).+
See the regex demo
Details:
^ - start of a string
(?!.*\.(x[ms]l|cfg)$) - a negative lookahead that fails the match if any 0+ chars other than line break chars (.*) are followed with xml, xsl or cfg ((x[ms]l|cfg)) at the end of the string ($)
.+ - any 1 or more chars other than linebreak chars. Might be omitted if the entire string match is not required (in some tools it is required though).
You need something like this, which matches only if the end of the string isn't preceded by a dot and one of the three unwanted types
/(?<!\.(?:xml|xsl|cfg))\z/

Regex not matching when the start or end are empty

Here is my regex as I have inputted it into my java file.
String myRegex = "(?<=[^a-zA-Z0-9])(target)(?=[^a-zA-Z0-9])";
If I have a string as follows:
.target. - it works.
However, if I have a string that JUST says target it does not work. How can I modify the regex so that if there is nothing at the start or the end of the string, it still matches?
EDIT - Examples.
_target - Should succeed!
target_ - Should succeed!
target - Should succeed!
Currently these examples fail with the current regex.
Add "start of input" to your look behind and add "end of input" to your look ahead using a regex alternation (ie | which is a logical "or"):
String myRegex = "(?<=^|[^a-zA-Z0-9])target(?=[^a-zA-Z0-9]|$)";
The problem with your regex is that your look behind required there to be a preceding character that was not a letter/digit.
These look arounds also match start/end of input.
See live demo.
The problem is, there are two negatives happening here. My lookbehinds are can be negative, and my character classes can be negatives. Currently, my lookbehinds are positive and my character classes are negatives. So it's saying: "Look behind and make sure you find something that is not within these classes". So when you there is nothing there, it won't find it and will fail. The solution was to make my look behind negative and make the character classes positive. So now it's saying "Look behind and sure there ISN'T any of these characters". So if it is empty, it won't fail because it meets the condition.
This is the final regex:
String myRegex = "(?<![a-zA-Z0-9])target(?![a-zA-Z0-9])";
If I'm understanding your question correctly, instead of using the look ahead and look behind, you can just use the ? to indicate that there should be 0 or 1 non-alphabetical or numerical character before and after "target".
([^a-zA-Z0-9])?(target)([^a-zA-Z0-9])?
You should be able to match target using the * 0 or more quantifier to match any target which have 0 or more occurrences of the characters you want. So:
[_]*(target)[_]*
should match:
_target
target
target_
_target_
Add any element you want to be matched before or after the word to the brackets. Example to match .target. too:
[\._]*(target)[\._]*
This will match target substring no matter what part of the string it is. If you want to make the rule only for match at the start of the string then add the ^ anchor to it like:
^[\._]*(target)[\._]*
and will match the ones mentioned above only if they start the string.

RegEx - match the whole <a> tag in java

I'm trying to match this <a href="**something**"> using regex in java using this code:
Pattern regex = Pattern.compile("<([a-z]+) *[^/]*?>");
Matcher matcher = regex.matcher(string);
string= matcher.replaceAll("");
I'm not really familiar with regex. What am I doing wrong? Thanks
If you just want to find the start tag you could use:
"<a(?=[>\\s])[^>]*>"
If you are trying to get the href attribute it would be better to use:
"<a\\s+[^>]*href=(['\"])(.*?)\\1[^>]*>"
This would capture the link into capturing group 2.
To give you an idea of why people always say "don't try to parse HTML with a regular expression", here'e a simplified regex for matching an <a> tag:
<\s*a(?:\s+[a-z]+(?:\s*=\s*(?:[a-z0-9]+|"[^"]*"|'[^']*'))?)*\s*>
It actually is possible to match a tag with a regular expression. It just isn't as easy as most people expect.
All of HTML, on the other hand, is not "regular" and so you can't do it with a regular expression. (The "regex" support in many/most languages is actually more powerful than "regular", but few are powerful enough to deal with balanced structures like those in HTML.)
Here's a breakdown of what the above expression does:
<\s* < and possibly some spaces
a "a"
(?: 0 or more...
\s+ some spaces
[a-z]+ attribute name (simplified)
(?: and maybe...
\s*=\s* an equal sign, possibly with surrounding spaces
(?: and one of:
[a-z0-9]+ - a simple attribute value (simplified)
|"[^"]*" - a double-quoted attr value
|'[^']*' - a single quoted atttr value
)
)?
)*
\s*> possibly more spaces and then >
(The comments at the start of each group also talk about the operator at
the end of the group, or even in the group.)
There are possibly other simplifications here -- I wrote this from
memory, not from the spec. Even if you follow the spec to the letter, browsers are even more fault tolerant and will accept all sorts of invalid input.
you can just match against:
"<a[^>]*>"
If the * is "greedy" in java (what I think it is, this is correct)
But you cannot match < a whatever="foo" > with that, because of the whitespaces.
Although the following is better, but more complicated to understand:
"<\\s*a\\s+[^>]*>"
(The double \\ is needed because \ is a special char in a java strings)
This handles optional whitespaces before a and at minimum one whitespace after a.
So you don't match <abcdef> which is not a correct a tag.
(I assume your a tag stands isolated in one line and you are not working with multiline mode enabled. Else it gets far far more complicated.)
your last *[^/]*?> seems a little bit strange, maybe it doesn't work cause of that.
Ok lets check what you are doing:
<([a-z]+) *[^/]*?>
<([a-z]+)
match something that contains an <followed by a [a-z] at least one time. This is grouped by the brackets.
Now you use a * which means the defined group ([a-z])* may appear multiple time, or not.
[^/]*
This means now match everything, but a / or nothing (because of the *)
The question mark is just wrong, not sure how this is interpreted.
Last char > matched as last element, which must appear.
To sum up, your expression is just wrong and cannot work :)
Take a look at: http://www.regular-expressions.info/
This is a good starting point.

java regex: match input starting with non-number or empty string followed by specific pattern

I'm using Java regular expressions to match and capture a string such as:
0::10000
A solution would be:
(0::\d{1,8})
However, the match would succeed for the input
10::10000
as well, which is wrong. Therefore, I now have:
[^\d](0::\d{1,8})
which means it must lead with any character except a number, but that means there needs to be some character before the first zero. What I really want (and what I need help with) is to say "lead with a non-number or nothing at all."
In conclusion the final solution regular expression should match the following:
0::10000kjkj0::10000
and should not match the following:
10::10000
This site may be of use if someone wants to help.
Thanks.
You need a negative lookbehind:
(?<!\d)(0::\d{1,8})
It means "match 0::\d{1,8} not preceded by \d".

How to make negative lookahead work with end of line text

I have a regex like the following:
.{0,1000}(?!(xa7|para(graf))$)
using Java.
I was expecting that it would cause the following text to fail:
blaparagraf
because paragraf is found at the end
That's because .{0,1000} will match the entire subject, hence it's not followed by xa7 or paragraf (it's followed by $ only).
You want negative lookbehind:
.{0,1000}(?<!xa7|paragraf)$
It is a common a mistake to misplace assertions. If you want to use lookahead, the pattern is something like this:
^(?!.*paragraph$).*$
This matches (as seen on rubular.com):
something something para
paragraph something something
But doesn't match:
something paragraph
So the key difference here is that we start looking ahead at the beginning of the string, before we match .* (or .{0,1000} in your case). Of course, what we're looking for isn't simply paragraph$, but rather .*paragraph$.
That said, to check that a string doesn't end with something of finite length, lookbehind when supported is the most natural solution.
^.*$(?<!paragraph)

Categories