Differences in regex syntax in Python and Java - java

I have a following working regex in Python and I am trying to convert it to Java, I thought that regex works the same in both languages, but obviously it doesn't.
Python regex: ^\d+;\d+-\d+
My Java attempt: ^\\d+;\\d+-\\d+
Example strings that should be matched:
3;1-2,2-3
68;12-15,1-16,66-1,1-2
What is the right solution in Java?
Thank you, Tomas

The regex is faulty for the input, don't know what you were doing in Python, but this isn't matching the whole strings in any regex I know.
This should do the trick (escaping characters are omitted):
^\d+;(\d+-\d+,?)+
I.e. you need to continue matching the number pairs separated by commas.

Related

Match Lua multiline strings and comments with Regex

I have a Lua editor in which I implemented syntax highlighting. I use regexes to match expressions like strings, comments, tokens, numbers, etc of Lua. The whole thing is made in Java and uses Java regexes. I had trouble with two things:
Multiline strings - Lua multiline brackets start and end with double square brackets [[ Everything between is the string, there can even be nested multiline strings. You can see what I made here, the regex is \[\[((?>[^\[\[\]\]]|(?R))*\]\]) and it works. It's similar to what you can see on this page under the match balanced constructs section. It finds expressions with equal amounts of [[ and ]] The thing is, recursion is not supported by Java regex engine. How can I replace it with something supported?
Multiline comments - Lua multiline comments start with --[====[ and end with ]====]. It ends only if there is as much equal signs as the opening bracket. There can be anywhere between 0 and infinite equal signs. I made this regex --\[\[((.|\n)*?)\]\] but it only works for the --[[ comment ]] pattern and do not support this --[==[ comment ]==]. Maybe I could do something like counting number of matches of equal signs at the opening then match the same the number for the closing tag. Is this possible in java regex? How?
Try this
--\[(=*)\[(.|\n)*?\]\1\]
Multiline string literals are absolutely the same but without leading --:
\[((=*)\[(.|\n)*?)\]\2\]

Regular Expressions match randomly instead of around quotes in Java

I am writing a program in Java, using Regular expressions, and have run into an error. What I am trying to do, is basically make a programming language, and parse it line by line. Where I am going wrong, is when it tries to find any strings. The thing is, is that I have to have it in the order of identifiers, strings, then integers, but I can have the identifiers find strings. Strings are defined by having double quotes around them. Here is where I have a test, and my expression: here, or here, if you do not want to go to the link:
[^"]([^\W][a-zA-Z0-9]+)[^"]
I cannot show my Java code, because it is all over the place, with the way I programmed it. It should just be the expression, and that's it.
It would be helpful if you can explain more what exactly you are trying to match. E.g. give some example texts and what your expression currently outputs for them.
At the moment I think you are trying to match Strings, text that is surrounded by ". For example foofoo"text123"barbar and your desired output is text123.
If defining a regular expression in Java, you need to escape special characters like ". Here is a Java-usable version for the Regex you have provided:
Pattern pattern = Pattern.compile("[^\"]([^\\W][a-zA-Z0-9]+)[^\"]");
You may then use the Pattern object together with a Matcher object to find your text. Here's the Java-Doc for Pattern.
Here is a Pattern that matches text surrounded by ":
Pattern pattern = Pattern.compile("\"[^\"]*\"");

How to pattern match [ and ] in Java?

The string is something in the format:
[anything anything]
with a space separating the two, 'anything's.
I've tried:
(string).replaceAll("(^[)|(]$)","");
(string).replaceAll("(^\[)|(\]$)","");
but the latter gives me a compilation error and the first doesn't do anything. I implemented my current solution based on:
Java Regex to remove start/end single quotes but leave inside quotes
Looking around SO yields me many questions that answer problems similar to mine but implementing their solutions do not work (they either do nothing, or yield compilation errors):
regex - match brackets but exclude them from results
Regular Expressions on Punctuation
What am I doing wrong?
Since both Java and regex treats the \ character as an escape character, you actually have to double them when using in a Java literal string.
So the regular expression:
(^\[)|(\]$)
in a Java string actually should be:
"(^\\[)|(\\]$)"

Differences in RegEx syntax between Python and Java

I have a working regex in Python and I am trying to convert to Java. It seems that there is a subtle difference in the implementations.
The RegEx is trying to match another reg ex. The RegEx in question is:
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
One of the strings that it is having problems on is: /\s+/;
The reg ex is not supposed to be matching the ending ;. In Python the RegEx works correctly (and does not match the ending ;, but in Java it does include the ;.
The Question(s):
What can I do to get this RegEx working in Java?
Based on what I read here there should be no difference for this RegEx. Is there somewhere a list of differences between the RegEx implementations in Python vs Java?
Java doesn't parse Regular Expressions in the same way as Python for a small set of cases. In this particular case the nested ['s were causing problems. In Python you don't need to escape any nested [ but you do need to do that in Java.
The original RegEx (for Python):
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
The fixed RegEx (for Java and Python):
/(\\.|[^\[/\\\n]|\[(\\.|[^\]\\\n])*\])+/([gim]+\b|\B)
The obvious difference b/w Java and Python is that in Java you need to escape a lot of characters.
Moreover, you are probably running into a mismatch between the matching methods, not a difference in the actual regex notation:
Given the Java
String regex, input; // initialized to something
Matcher matcher = Pattern.compile( regex ).matcher( input );
Java's matcher.matches() (also Pattern.matches( regex, input )) matches the entire string. It has no direct equivalent in Python. The same result can be achieved by using re.match( regex, input ) with a regex that ends with $.
Java's matcher.find() and Python's re.search( regex, input ) match any part of the string.
Java's matcher.lookingAt() and Python's re.match( regex, input ) match the beginning of the string.
For more details also read Java's documentation of Matcher and compare to the Python documentation.
Since you said that isn't the problem, I decided to do a test: http://ideone.com/6w61T
It looks like java is doing exactly what you need it to (group 0, the entire match, doesn't contain the ;). Your problem is elsewhere.

Simple regex required

I've never used regexes in my life and by jove it looks like a deep pool to dive into. Anyway,
I need a regex for this pattern (AN is alphanumeric (a-z or 0-9), N is numeric (0-9) and A is alphabetic (a-z)):
AN,AN,AN,AN,AN,N,N,N,N,N,N,AN,AN,AN,A,A
That's five AN's, followed by six N's, followed by three AN's, followed finally by two A's.
If it makes a difference, the language I'm using is Java.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
should work in most RE dialects for the tasks as you specified it -- most of them will also support abbreviations such as \d (digit) in lieu of [0-9] (but if alphabetics need to be lowercase, as you appear to be requesting, you'll probably need to spell out the a-z parts).
Replace each AN by [a-z0-9], each N by [0-9], and each A by [a-z].
30 seconds in Expresso:
[a-zA-Z0-9]{5}[0-9]{6}[a-zA-Z0-9]{3}[0-9]{2}
Case insensitive, but you can probably define that in Java instead of the regex.
For the example you posted, the following should work fine.
(([A-Za-z\d])*,){5}+(([\d])*,){6}+(([A-Za-z\d])*,){3}+([\d])*,[\d]*
In Java you should be able use it like this:
boolean foundMatch = subjectString.matches("(([A-Za-z\\d])*,){5}+(([\\d])*,){6}+(([A-Za-z\\d])*,){3}+([\\d])*,[\\d]*");
I used, this tool to help in learning RegEx, it also make this really easy.
http://www.regexbuddy.com/
Try looking at some simple java regex tutorials such as this
They'll tell you how you form regular expressions and also how to use it in java.
This should match the pattern you request.
[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}
In addition, you could add Beginning of String / End of String matches, if your string match should fail if any other chars are in it:
^[a-z0-9]{5}[0-9]{6}[a-z0-9]{3}[a-z]{2}$

Categories