Regex exceptions to introduce in array of values

Regex exceptions to introduce in array of values - java

i have created regular expression like this:
^[0-9][0-9][A-Z][A-Z][a-z]_([0-9]{1,10})_([0-9]{1,11})_([0-9]{1,11})$
It should give me values range from 01BRa_1_1_1 to 99BRz_9999999999_99999999999_99999999999
My problem is that I need to exclude values 0 from _number_number_number and to start from number 1.
Have been trying different expressions but can't find right one.
If someone knows how to solve thi help will be good. thx.
Goal is to eliminate 0_0_0 and also 00_00_00 and also 000_000_000 and all situations where 0 is first number so the first combination would be 1_1_1 for those 3 fields.
I am using this in Java (to reply to one comment) but do not see relevance of that more or less this is just a Pattern.
Resolved with this:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,9})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$

If your goal is to eliminate values equal to 0 (0, 00, 000, etc) then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_(?!0+_)([0-9]{1,10})_(?!0+_)([0-9]{1,11})_(?!0+$)([0-9]{1,11})$
Of course, this will depend on your regex engine supporting variable-length zero-width assertions (aka "lookahead"). It would help to know which flavor you are using. (From the regex tooltip: "Please also include a tag specifying the programming language or tool you are using.")
If your goal is to eliminate anything starting with 0, (0, 01, 001, etc), then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$

Related

How to properly lex negative numbers?

Following this example for implementing a simple lexer, I found that it doesn't properly resolve the operators.
E.g. if you give it a string 1 - 2 it works, but 1-2 does not.
Second example gives two tokens: 1 and -2, but it should recognize the minus sign.
It fails because the regex NUMBER("-?[0-9]+") succeeds first.
If I switch the regexes, then it fails on 1+-2 (4 tokens instead of 3).
Can this problem be solved with this "just-a-list-of-regexes" approach somehow?
Or we need to look ahead and resolve it manually always? How would that look like?

Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.
The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)
The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!
In my code, I have tried a couple of ways to match regex:
if ( (regex findFirstIn log).nonEmpty ) { do something }
val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
if (regex.pattern.matcher(log).matches()){do something}
All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?
The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.
I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.

The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group
(?>(.+?)#011)
Once matched, there's no more backtracking. Done and looking forward for the next group.
Negating a string
The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:
((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).
An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:
((?: (?!#011). )+)#011
It's all pretty complicated and most probably less performant than simply using the atomic group.
Optimizations
Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)
( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.
A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.

Regex: How not to match a few letters

I have the following string: SEE ATTACHED ADDENDUM TO HUD-1194,520.07
Inside that string is HUD-1 and after that is 194,520.07. What I want is the 194,520.07 part.
I have written the following regular expression to pull that value out:
[^D\-1](?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?
However, this pulls out: 94,520.07
I know it has something to do with this part: [^D\-1] "eating" to many of the 1's. Any ideas how I can stop it from "eating" 1's after the first one that appears in HUD-1?
UPDATED:
The reason for all the other stuff is I only want to match as well if the value after HUD-1 is a money amount. And the rest of that regex tries to determine all the different ways a money amount could be written

Why not something as simple as:
.*HUD\-1(.*+)
Ok, you need to be more restrictive I see based on your updated question. Try changing [^D\-1] to just (?:HUD\-1)?. For what it's worth, your currency RegEx is vary lax, allowing input like:
001 001 .31412341234123
You might consider not reinventing the wheel there, I'm sure you can find a currency RegEx quickly via Google. Otherwise, I'd also suggest anchoring your RegEx with a $ at the end of it.

this change will make the second match group of the regex include the full number you would like (everything after the first 1), and put the possible HUD-1 in a separate matching group, if present.
(HUD-1)?((?:-|\()?\$?(?:\d{1,3}[ ,]?)*(?:\.\d+)\)?)

Regular Expression for IP validation which works in JFLAP

I noticed that regular expressions which we programmers use in our programs for tasks such as
email address validation
IP validation
...
are a bit different from those Regular Expressions which are used in Automata (if I'm not mistaken)
By the way I want to design an NFA and eventually a DFA for IP validation.
I have found a lot of regular expression such as the following one:
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
But I can not convert it to an NFA or DFA using JFLAP.
What should I do?

You don't need to directly convert the regex, you can rewrite it once you understand what it's trying to do.
A valid IPv4 address is 4 numbers separated by decimal points. Each number can be from 0 to 255. Regex doesn't do range very well, so that's why it looks like it does. The regex you posted checks if it starts with a 2, then the next two numbers cannot be greater than 5 each, if it starts with 1, they can go up to 9, etc.
Easiest way to validate a regex is to split it with the . as the delimiter, convert the strings to numbers, and check their range.
That said, there is nothing non-standard in the regex you posted. It's as simple as they come, I don't know why it doesn't work as-is for you.

Special Regular Expression syntax in Java

I am using a regular expression for image file names.
The main reason why I'm using RegEx's is to prevent multiple files for the exact same purpose.
The syntax for the filenames can either be:
1) img_0F_16_-32_0.png
2) img_65_32_x.png
As you might have noticed, "img_" is the general prefix.
What follows is a two-digit hexadecimal number.
After another underscore comes an integer that has to be a power of two, somewhere between 1 through 512. Yet another underscore is next.
Okay so this far, my regular expression is working flawlessly.
The rest is what I'm having problems with:
Because what can follow is either a pair of integer coordinates (can be 0), separated by an underscore, or an x. After this comes the final ".png". Done.
Now the main problem I am having is that both variants have to be possible,
and also it is highly important that there may not be any duplicate coordinates.
Most importantly, integers, both positive and negative, may never start with one or more zeros!
This would produce duplications like:
401 = 00401
-10 = -0010
This is my first attempt:
img_[0-9a-fA-F]{2}_(1|2|4|8|16|32|64|128|256|512)_([-]?[1-9])?[0-9]*_([-]?[1-9])?[0-9]*[.]png
Thanks for your help in advance,
Tom S.

Why use regular expressions? Why not create a class that decomposes either variant of String to a canonical String, give the class a hashCode() and equals() method that uses this canonical String and then create a HashSet of these objects to make sure that only one of these types of files exist?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex exceptions to introduce in array of values - java

Related

How to properly lex negative numbers?

Java, poor regex performance with lazy expressions

Regex: How not to match a few letters

Regular Expression for IP validation which works in JFLAP

Special Regular Expression syntax in Java

Categories

Resources