How to add a negation to \p{S} - java

Hi someone offered me a great solution that included \p{S} to match any symbol in a regex. The problem is that I need to eliminate two symbols from the \p{S}. I don't want & or ' to be a match.
I thought maybe \p{S^&^'} would work but it doesn't. I have looked online but I am not really sure what to search for.
Please help.
\b\p{L}*[\p{S}\p{P}]((\p{L}[\p{P}\p{S}])|([\p{P}\p{S}]\p{L})|(\p{L}))+\b
The other solution is \b([a-zA-Z]+(?:[^\w\s^'&]|_)[a-zA-Z]*)|[a-zA-Z]*(?:[^\w\s^'&]|_)[a-zA-Z]+\b but it catches words ending in punctuation. If it didn't do that it would work also.

Use character class subtraction (if available):
[\p{S}-[&']]
If not available, use a lookahead:
(?!.?[&'])\p{S}

You could use a negative look-ahead (?!.*[&']), i.e.
\b(?!.*[&'])\p{L}*[\p{S}\p{P}]((\p{L}[\p{P}\p{S}])|([\p{P}\p{S}]\p{L})|(\p{L}))+\b

Related

regular expression to limit 300 words

I need a regex which success 0-300 words and fails 301 or more words.
I tried:
^\s*(\S+\s+){0,300}\S*$
I also checked
^\W*(?:\w+\b\W*){0,300}$
Both are working fine but in Java I get a java.lang.StackOverflowError. I know using a larger "XSS" I get around this issue but I wanted to ask if there is a way to optimize the regex?
I believe the problem is because the Java implementation of Pattern uses up a stack for each repetition of a group because of backtracking. The solution might be to either change your approach as others have answered or to make all quantifiers possessive :
^\s*(\S+\s+){0,300}+\S*$
or
^\W*(?:\w+\b\W*){0,300}+$
For more info, see here or here.
You can use String.split and check the size of the returned array.
Your regex will fail if the 300th word is the last word and there is notspace ahead of it.You should use
^ *(?:\S+(?: +|$)){0,300} *$

Limited currency regex

I have found a lot of good currency regular expressions that get very close to what I need. Alas, I am no regex guru and can't seem to edit my current regex to meet requirements.
I need to limit the valid inputs to the format of 'xxx,xxx.xx'. The max allowed amount needs to be '999,999.99' with commas optional. I've been using this regex until now:
^([0-9]{1,3}(,[0-9]{3})*|([0-9]+))(.[0-9]{2})?$
It has been working great except for not being able to make the upper limit '999,999.99'. Thanks for the help!
Update
I've been tinkering and I've managed to come up with this:
/^(?:([0-9]{3}?,?)?[0-9]{3}(?:\.[0-9]?[0-9]?)?)$/
Still testing to see if it works. RegexPlanet isn't passing it with any of the Strings I try, but I'll be going through my app and manually testing.
burning_LEGION's answer authorizes some cases I think you probably don't want:
- 999,9
- 9.
I'll assume you want those conditions fulfilled:
- if there is a comma, there are 3 numbers after
- if there is a point, there are 2 numbers after
^\d{1,3}(,?\d{3})?(\.\d{2})?$
use this regex ^\d{1,3}(,?\d{1,3}){0,1}(\.\d{0,2})?$

Java regular expression for a digit followed by z^3?

I want to check if a string matches the form az^3 where a is any integer.
I've tried the following:
str.matches("\\d* z^3")
str.matches("\\d* z\^3")
str.matches("^(\\d* z^3)$")
str.matches("^(\\d* z\^3)$")
str.matches("\\d* (z^3)")
str.matches("\\d* (z\^3)")
This is driving me crazy. :-(
I've tried every possible regex tutorial and searched for examples and I still can't even come up with a solution.
I'd really appreciate if anyone can help me.
You need to escape the backslash in Java
str.matches("\\d+z\\^3");

with regex, is using both "is" and "is not" range definitons within the same range possible?

Note: I'm using a 3rd party app that uses regex for searches which has its own flavor but almost always works like java's flavor of regex. Of course this may not matter.
After searching for many different ways of this same question (phrased many ways), I did not see any tutorials, examples, or even mentions of whether it is possible to use both an "is" (positive?) and "is not" (negative?) definition within the same range.
I can't run a test the example right now in the app to see if my ideas work, because the amount of data being searched is massive and will screw up the matches it has already gathered. I'm only asking because of this.
Here are examples of what I thought might work but caused tester to act weird:
[\w^\s<>.!?]{2}
[\w|^\s<>.!?]{2}
I would rather have it work the way I think the first one would work (any digit, lower case, or upper case character, or other normal character that is not a space, >, <, period, !, or ?) rather then the second which only has an or operator.
The regex testers I used gave me different funky results which is what is confusing me.
Also note: I'm using this within a capture group which is followed by a catch everything match which I may or may not be using properly. So if you'd like to include how to follow what I'm attempting with how to properly do that, feel free. I AM MAINLY JUST CURIOUS TO IF THIS WAS POSSIBLE OR NOT, OR IF IT WAS A IMPROPER METHOD.
Why do you need the \w at all?
[^\s<>.!?]{2}
This already matches all alphanumeric characters since they are neither space nor any of the punctuation characters you mentioned.
In general, you can substract character classes to some degree, for example, to match alphanumerics exluding digits, you can do
[^\W\d]
because [^\W] matches the same as \w, and \d is substracted from that because it's in a negated character class.
Edit:
Some regex engines (like XPath, .NET and JGSoft) allow flexible character class substraction like this:
[a-z-[e-g]]
to match any character from the range [a-z], excluding e, f and g. But Java does not have this feature.
Another possibility is to use two ranges and combine them; e.g.
([\w]|[^\s<>.!?]){2}
However, this does bring up the question of what you are actually trying to express here. Because this example (as I've rewritten it) doesn't make a lot of sense.
What it says is "a word character, or any character that is not whitespace or certain punctuation". But the class of characters that are not "whitespace or certain punctuation" ALREADY includes all of the word characters. So, unless you mean something different, the \w is redundant.
From your question, it looks like a no-space regex would match your needs, you can achieve that with:
[\S]{2}

Java Regex, capturing groups with comma separated values

InputString: A soldier may have bruises , wounds , marks , dislocations or other Injuries that hurt him .
ExpectedOutput:
bruises
wounds
marks
dislocations
Injuries
Generalized Pattern Tried:
".[\s]?(\w+?)"+ // bruises.
"(?:(\s)?,(\s)?(\w+?))*"+ // wounds marks dislocations
"[\s]?(?:or|and) other (\w+)."; // Injuries
The pattern should be able to match other input strings like: A soldier may have bruiser or other injuries that hurt him.
On trying the generalized pattern above, the output is:
bruises
dislocations
Injuries
There is something wrong with the capturing group for "(?:(\s)?,(\s)?(\w+?))*". The capturing group has one more occurences.. but it returns only "dislocations". "marks" and "dislocation: are devoured.
Could you please suggest what should be the right pattern, and where is the mistake?
This question comes closest to this question, but that solution didn't help.
Thanks.
When the capture group is annotated with a quantifier [ie: (foo)*] then you will only get the last match. If you wanted to get all of them then you need to quantifier inside the capture and then you will have to manually parse out the values. As big a fan as I am of regex, I don't think it's appropriate here for any number of reasons... even if you weren't ultimately doing NLP.
How to fix: (?:(\s)?,(\s)?(\w+?))*
Well, the quantifier basically covers the whole regex in that case and you might as well use Matcher.find() to step through each match. Also, I'm curious why you have capture groups for the whitespace. If all you are trying to do is find a comma-separated set of words then that's something like: \w+(?:\s*,\s*\w+)* Then don't bother with capture groups and just split the whole match.
And for anything more complicated re: NLP, GATE is a pretty powerful tool. The learning curve is steep at times but you have a whole industry of science-guys to draw from: http://gate.ac.uk/
Regex in not suited for (natural) language processing. With regex, you can only match well defined patterns. You should really, really abandon the idea of doing this with regex.
You may want to start a new question where you specify what programming language you're using to perform this task and ask for pointers there.
EDIT
PSpeed posted a promising link to a 3rd party library, Gate, that's able to do many language processing tasks. And it's written in Java. I have not used it myself, but looking at the people/institutions working on it, it seems pretty solid.
The pattern that works is: \w+(?:\s*,\s*\w+)* and then manually separate CSV
There is no other method to do this with Java Regex.
Ideally, Java regex is not suitable for NLP. A useful tool for text mining is: gate.ac.uk
Thanks to Bart K. , and PSpeed.

Categories