regular expression to limit 300 words - java

I need a regex which success 0-300 words and fails 301 or more words.
I tried:
^\s*(\S+\s+){0,300}\S*$
I also checked
^\W*(?:\w+\b\W*){0,300}$
Both are working fine but in Java I get a java.lang.StackOverflowError. I know using a larger "XSS" I get around this issue but I wanted to ask if there is a way to optimize the regex?

I believe the problem is because the Java implementation of Pattern uses up a stack for each repetition of a group because of backtracking. The solution might be to either change your approach as others have answered or to make all quantifiers possessive :
^\s*(\S+\s+){0,300}+\S*$
or
^\W*(?:\w+\b\W*){0,300}+$
For more info, see here or here.

You can use String.split and check the size of the returned array.

Your regex will fail if the 300th word is the last word and there is notspace ahead of it.You should use
^ *(?:\S+(?: +|$)){0,300} *$

Related

Java regex to filter strings starting with a pattern and end after first ] appearance [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Extracting multiple labels from a string with java regex [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?
Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:
Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.
Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;
I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm
(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Limited currency regex

I have found a lot of good currency regular expressions that get very close to what I need. Alas, I am no regex guru and can't seem to edit my current regex to meet requirements.
I need to limit the valid inputs to the format of 'xxx,xxx.xx'. The max allowed amount needs to be '999,999.99' with commas optional. I've been using this regex until now:
^([0-9]{1,3}(,[0-9]{3})*|([0-9]+))(.[0-9]{2})?$
It has been working great except for not being able to make the upper limit '999,999.99'. Thanks for the help!
Update
I've been tinkering and I've managed to come up with this:
/^(?:([0-9]{3}?,?)?[0-9]{3}(?:\.[0-9]?[0-9]?)?)$/
Still testing to see if it works. RegexPlanet isn't passing it with any of the Strings I try, but I'll be going through my app and manually testing.
burning_LEGION's answer authorizes some cases I think you probably don't want:
- 999,9
- 9.
I'll assume you want those conditions fulfilled:
- if there is a comma, there are 3 numbers after
- if there is a point, there are 2 numbers after
^\d{1,3}(,?\d{3})?(\.\d{2})?$
use this regex ^\d{1,3}(,?\d{1,3}){0,1}(\.\d{0,2})?$

How to add a negation to \p{S}

Hi someone offered me a great solution that included \p{S} to match any symbol in a regex. The problem is that I need to eliminate two symbols from the \p{S}. I don't want & or ' to be a match.
I thought maybe \p{S^&^'} would work but it doesn't. I have looked online but I am not really sure what to search for.
Please help.
\b\p{L}*[\p{S}\p{P}]((\p{L}[\p{P}\p{S}])|([\p{P}\p{S}]\p{L})|(\p{L}))+\b
The other solution is \b([a-zA-Z]+(?:[^\w\s^'&]|_)[a-zA-Z]*)|[a-zA-Z]*(?:[^\w\s^'&]|_)[a-zA-Z]+\b but it catches words ending in punctuation. If it didn't do that it would work also.
Use character class subtraction (if available):
[\p{S}-[&']]
If not available, use a lookahead:
(?!.?[&'])\p{S}
You could use a negative look-ahead (?!.*[&']), i.e.
\b(?!.*[&'])\p{L}*[\p{S}\p{P}]((\p{L}[\p{P}\p{S}])|([\p{P}\p{S}]\p{L})|(\p{L}))+\b

Android: Matcher.find() never returns

First of all, here is a chunk of affected code:
// (somewhere above, data is initialized as a String with a value)
Pattern detailsPattern = Pattern.compile("**this is a valid regex, omitted due to length**", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher detailsMatcher = detailsPattern.matcher(data);
Log.i("Scraper", "Initialized pattern and matcher, data length "+data.length());
boolean found = detailsMatcher.find();
Log.i("Scraper", "Found? "+((found)?"yep":"nope"));
I omitted the regex inside Pattern.compile because it's very long, but I know it works with the given data set; or if it doesn't, it shoudn't break anything anyway.
The trouble is, I do get the feedback I/Scraper(23773): Initialized pattern and matcher, data length 18861 but I never see the "Found?" line, it is just stuck on the find() call.
Is this a known Android bug? I've tried it over and over and just can't get it to work. Somehow, I think something over the past few days broke this because my app was working fine before, and I have in the past couple days received several comments of the app not working so it is clearly affecting other users as well.
How can I further debug this?
Some regexes can take a very, very long time to evaluate. In particular, regexes that have lots of quantifiers can cause the regex engine to do a huge amount of backtracking to explore all of the possible ways that the input string might match. And if it is going to fail, it has to explore all of those possibilities.
(Here is an example:
regex = "a*a*a*a*a*a*b"; // 6 quantifiers
input = "aaaaaaaaaaaaaaaaaaaa"; // 20 characters
A typical regex engine will do in the region of 20^6 character comparisons before deciding that the input string does not match.)
If you showed us the regex and the string you are trying to match, we could give a better diagnosis, and possibly offer some alternatives. But if you are trying to extract information from HTML, then the best solution is to not use regexes at all. There are HTML parsers that are specifically designed to deal with real-world HTML.
How long is the string you are trying to parse ?
How long and how complicated is the regex you are trying to match ?
Have you tried to break down your regex down to simpler bits ? Adding up the bits one after another will let you see when it breaks and maybe why.
make some RE like [a-zA-Z]* pass it as argument to compile(),here this example allows only characters small & cap.
Read my blogpost on android validation for more info.
I had the same issue and I solved it replacing all the wildchart . with [\s\S]. I really don't know why it worked for me but it did. I come from Javascript world and I know in there that expression is faster for being evaluated.

Categories