How to properly lex negative numbers? - java

Following this example for implementing a simple lexer, I found that it doesn't properly resolve the operators.
E.g. if you give it a string 1 - 2 it works, but 1-2 does not.
Second example gives two tokens: 1 and -2, but it should recognize the minus sign.
It fails because the regex NUMBER("-?[0-9]+") succeeds first.
If I switch the regexes, then it fails on 1+-2 (4 tokens instead of 3).
Can this problem be solved with this "just-a-list-of-regexes" approach somehow?
Or we need to look ahead and resolve it manually always? How would that look like?

Related

Regex to match a fixed sub string in a String

I am trying to write a regular expression to verify the presence of a specific number in a fixed position in a String.
String: 109300300330066611111111100000000017000656052086116020170111Name 1
Number to find: 111111111 (Staring from position 17)
I have written the following regular expression:
^.{16}(?<Ones>111111111)(.*)
My understanding is:
Let first 16 characters be whatever they are
Use the Named Capturing Group to grab the specific word
Let the rest of the characters be whatever they are
I am new to regex, is there any issue with the above approach?
Can it be done in other/better way?
I am using Java 8.
Without more details of why you're doing what you're doing, there's just one possible improvement I can see. You repeated any character 16 times at the beginning of the string rather than writing out 16 .s, which is nice and readable, but then, it would be nice to do the same for the repeated 1s:
^.{16}(?<Ones>1{9})(.*)
Otherwise, the string of 1s is hard to understand without the coder manually counting how many there are in the regex.
If you want to hard-code the ones and you know the starting position and you just wnat to know if it is there, using a regex seems unnecessary. you can use this:
String s = "109300300330066611111111100000000017000656052086116020170111Name 1";
if (s.indexOf("111111111").equals(16) doSomething();
Another possible solution without regex:
if(s.substring(16,25).equals("111111111") doSomething();
Otherwise your regex looks good.

Regular Expression to Match Number of Lines and Characters per Line

I'm trying to make sure that a string contains between 0 and 3 lines, and that for a given line that is present that it contains 0 to 100 characters. It would need to be a valid expression for JavaScript and Java. Like many people doing RegEx I'm copying from various spots on the Internet.
Working backwards I think ^.{0,100}$ gets me the "line contains 0 to 100 characters", but trying to group that as (^.{0,100}$){0,3} doesn't work.
The new line character is probably part of my problem, so I ended up with something like .{0,100}(?:\n.{0,100}){0,2} trying to say "a line of 0 to 100 characters optionally followed by 0 to 2 instances of a new line and 0 to 100 more characters", but that also failed.
Up until now I got those expressions from other people. Using an online test tool I finally monkeyed this together: ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ which appears to work.
So, my question is, am I missing any pitfalls in ^.{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}$ given what I'm after? Furthermore, even if that does work is it the best expression to use?
I think what you have will work fine. You can make the line break part a little more compact if you want, and you don't need ^ and $ if you are using matches():
String regex = ".{0,100}(?:[\r\n]+.{0,100}){0,2}";
EDIT
After some more thoughts I realized the newline suggestion above will match 4 (or more) lines as long as a couple of them are empty. So, we are back to your suggested example. Oh well, at least the start and end characters can be omitted.
String regex = ".{0,100}(?:(?:\r\n|[\r\n]).{0,100}){0,2}";
I'm not very good at regular expressions but would this work?
^.{0,100}\n?(.{0,100}\n)?.{0,100}?$
Again I'm still new to reg exp, so if there is an error(which is likely) please tell me.

Regex exceptions to introduce in array of values

i have created regular expression like this:
^[0-9][0-9][A-Z][A-Z][a-z]_([0-9]{1,10})_([0-9]{1,11})_([0-9]{1,11})$
It should give me values range from 01BRa_1_1_1 to 99BRz_9999999999_99999999999_99999999999
My problem is that I need to exclude values 0 from _number_number_number and to start from number 1.
Have been trying different expressions but can't find right one.
If someone knows how to solve thi help will be good. thx.
Goal is to eliminate 0_0_0 and also 00_00_00 and also 000_000_000 and all situations where 0 is first number so the first combination would be 1_1_1 for those 3 fields.
I am using this in Java (to reply to one comment) but do not see relevance of that more or less this is just a Pattern.
Resolved with this:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,9})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$
If your goal is to eliminate values equal to 0 (0, 00, 000, etc) then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_(?!0+_)([0-9]{1,10})_(?!0+_)([0-9]{1,11})_(?!0+$)([0-9]{1,11})$
Of course, this will depend on your regex engine supporting variable-length zero-width assertions (aka "lookahead"). It would help to know which flavor you are using. (From the regex tooltip: "Please also include a tag specifying the programming language or tool you are using.")
If your goal is to eliminate anything starting with 0, (0, 01, 001, etc), then an expression like this might work:
^[0-9][0-9][A-Z][A-Z][a-z]_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})_([1-9][0-9]{0,10})$

Regular Expression for IP validation which works in JFLAP

I noticed that regular expressions which we programmers use in our programs for tasks such as
email address validation
IP validation
...
are a bit different from those Regular Expressions which are used in Automata (if I'm not mistaken)
By the way I want to design an NFA and eventually a DFA for IP validation.
I have found a lot of regular expression such as the following one:
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b
But I can not convert it to an NFA or DFA using JFLAP.
What should I do?
You don't need to directly convert the regex, you can rewrite it once you understand what it's trying to do.
A valid IPv4 address is 4 numbers separated by decimal points. Each number can be from 0 to 255. Regex doesn't do range very well, so that's why it looks like it does. The regex you posted checks if it starts with a 2, then the next two numbers cannot be greater than 5 each, if it starts with 1, they can go up to 9, etc.
Easiest way to validate a regex is to split it with the . as the delimiter, convert the strings to numbers, and check their range.
That said, there is nothing non-standard in the regex you posted. It's as simple as they come, I don't know why it doesn't work as-is for you.

Java Regex Engine Crashing

Regex Pattern - ([^=](\\s*[\\w-.]*)*$)
Test String - paginationInput.entriesPerPage=5
Java Regex Engine Crashing / Taking Ages (> 2mins) finding a match. This is not the case for the following test inputs:
paginationInput=5
paginationInput.entries=5
My requirement is to get hold of the String on the right-hand side of = and replace it with something. The above pattern is doing it fine except for the input mentioned above.
I want to understand why the error and how can I optimize the Regex for my requirement so as to avoid other peculiar cases.
You can use a look behind to make sure your string starts at the character after the =:
(?<=\\=)([\\s\\w\\-.]*)$
As for why it is crashing, it's the second * around the group. I'm not sure why you need that, since that sounds like you are asking for :
A single character, anything but equals
Then 0 or more repeats of the following group:
Any amount of white space
Then any amount of word characters, dash, or dot
End of string
Anyway, take out that *, and it doesn't spin forever anymore, but I'd still go for the more specific regex using the look behind.
Also, I don't know how you are using this, but why did you have the $ in there? Then you can only match the last one in the string (if you have more than one). It seems like you'd be better off with a look-ahead to the new line or the end: (?=\\n|$)
[Edit]: Update per comment below.
Try this:
=\\s*(.*)$

Categories