How to make Regex lookahead to match one and two digit numbers? - java

Let's say for example that I have a string reading "1this12string". I would like to use String#split with regex using lookahead that will give me ["1this", "12string"].
My current statement is (?=\d), which works very well for single digit numbers. I am having trouble modifying this statement to include both 1 and 2 digit numbers.

Add a look behind so you don't split within numbers:
(?<!\d)(?=\d)
See live demo.

If you really want to use Regex Lookahead, try this:
(\d{1,2}[^\d]*)(?=\d|\b)
Regex Demo
Note that this assume every string split must have 1 or 2 digits at the front. In case this is not the case, please let us know so that we can further enhance it.
Regex Logics
\d{1,2} to match 1 or 2 digits at the front
[^\d]* to match non-digit characters following the first 1 or 2 digit(s)
Enclose the the above 2 segments in parenthesis () so as to make it a capturing group for extraction of matched text.
(?=\d to fulfill your requirement to use Regex Lookahead
|\b to allow the matching text to be at the end of a text (just before a word boundary)
I think you can also achieve your task with a simpler regex, without using the relatively more sophisticated feature like Regex Lookahead. For example:
\d{1,2}[^\d]*
You can see in the Regex Demo that this works equally well for your sample input. Anyway, in case your requirement is anything more than this, please let us know to fine-tune it.

Use
String[] splits = string.split("(?<=\\D)(?=\\d)");
See regex proof
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\D non-digits (all but 0-9)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
) end of look-ahead

Related

Regex capturing: get only result from second group

I have a following string:
'pp_3', 365]
What comes after pp_ may have different length. What comes after , and is before ] is what I'd like to capture (and only it). Its length varies but it is always a number.
I've come up with (?<=pp_).*,(.*)(?=]). It gives 3', 365 as a full match and in group 1 there is what I want '365'. How do I get only 365 as a full match?
Please let me know if I am unable to explain my doubts. Thanks
Try this:
[^_]*_(\d*)'\s*,\s*(\-?\d+)\s*].
This regex captures 2 groups, that correspond to each of the numbers, the first after pp_ and the second after ', (which may be negative). If you don't want to capture the first one as a group, instead of (\d*), just write (?:\d*).
Try this expression. The second group should be what you're after:
(?<='pp_)(\d*', )(\d*)]
To match the digits only and if you want to make use of a positive lookbehind, you could make use of a quantifier in the lookbehind (which you can specify yourself) which is supported by Java
(?<=pp_[^,]{0,1000}, )\d+(?=])
Explanation
(?<= Positive lookbehind, assert what is on the left is
pp_[^,]{0,1000} Match pp_, match any char except , 0-1000 times
, Match a comma and space
) Close lookbehind
\d+ Match 1+ digits
(?=]) Positive lookahead, assert what is on the right is ]
In Java
String regex = "(?<=pp_[^,]{0,1000}, )\\d+(?=])";
Java demo
You could also use a capturing group instead:
pp_[^,]*, (\d+)]
Regex demo

How to match a string in this way?

I need to check if a String matches this specific pattern.
The pattern is:
(Numbers)(all characters allowed)(numbers)
and the numbers may have a comma ("." or ",")!
For instance the input could be 500+400 or 400,021+213.443.
I tried Pattern.matches("[0-9],?.?+[0-9],?.?+", theequation2), but it didn't work!
I know that I have to use the method Pattern.match(regex, String), but I am not being able to find the correct regex.
Dealing with numbers can be difficult. This approach will deal with your examples, but check carefully. I also didn't do "all characters" in the middle grouping, as "all" would include numbers, so instead I assumed that finding the next non-number would be appropriate.
This Java regex handles the requirements:
"((-?)[\\d,.]+)([^\\d-]+)((-?)[\\d,.]+)"
However, there is a potential issue in the above. Consider the following:
300 - -200. The foregoing won't match that case.
Now, based upon the examples, I think the point is that one should have a valid operator. The number of math operations is likely limited, so I would whitelist the operators in the middle. Thus, something like:
"((-?)[\\d,.]+)([\\s]*[*/+-]+[\\s]*)((-?)[\\d,.]+)"
Would, I think, be more appropriate. The [*/+-] can be expanded for the power operator ^ or whatever. Now, if one is going to start adding words (such as mod) in the equation, then the expression will need to be modified.
You can see this regular expression here
In your regex you have to escape the dot \. to match it literally and escape the \+ or else it would make the ? a possessive quantifier. To match 1+ digits you have to use a quantifier [0-9]+
For your example data, you could match 1+ digits followed by an optional part which matches either a dot or a comma at the start and at the end. If you want to match 1 time any character you could use a dot.
Instead of using a dot, you could also use for example a character class [-+*] to list some operators or list what you would allow to match. If this should be the only match, you could use anchors to assert the start ^ and the end $ of the string.
\d+(?:[.,]\d+)?.\d+(?:[.,]\d+)?
In Java:
String regex = "\\d+(?:[.,]\\d+)?.\\d+(?:[.,]\\d+)?";
Regex demo
That would match:
\d+(?:[.,]\d+)? 1+ digits followed by an optional part that matches . or , followed by 1+ digits
. Match any character (Use .+) to repeat 1+ times
Same as the first pattern

Regex to Split String on Even Number of Preceding Characters

I'm trying to develop a regex that will split a string on a single quote only if the single quote is preceded by zero question marks, or an even number of question marks. For example, the following string:
ABC??'DEF?'GHI'JKL????'MNO'
would result in:
ABC??
DEF?'GHI
JKL????
MNO
I've tried using this negative lookbehind:
(?<!\?\?)*\'
But that results in:
ABC??
DEF?
GHI
JKL????
MNO
I've also tried the following
(?<!(\?\?)*)\' results in runtime error
(?:\?\?)*\'
(?!\?\?)+\'
Any ideas would be greatly appreciated.
This isn't handy to use the split method in this kind situations. A workaround consists to describe all that isn't the delimiter and to use the find method:
[^?']+(?:\?.[^?']*)*|(?:\?.[^?']*)+
demo
pattern details:
[^?']* # zero or more characters that aren't a `?` or a `'`
(?: # open a non-capturing group
\? . # a question mark followed by a character (that can be a `?` or a `'`)
[^?']* #
)* # close the non-capturing group and repeat it zero or more times
[^?']*(?:\?.[^?']*)* describes all that isn't the delimiter including the empty string. To avoid empty matches, I use 2 branches of an alternation: [^?']+(?:\?.[^?']*)* and (?:\?.[^?']*)+ to ensure there's at least one character.
(If you want to allow the empty string at the start of the string, add |^ at the end of the pattern)
You can also use the split method but the pattern to do it isn't efficient since it needs to look backward for each position (and is limited since the lookbehind in java allows only limited quantifiers):
(?<=(?<!\?)(?:\?\?){0,100})'
or perhaps more efficient like this:
'(?<=(?<!\?)(?:\?\?){0,100}')
Have you tried positive lookbehind
(?<=.')
Regex101
This regex will do it:
[A-Z]+(\?\?)*'
If you only need to handle a single question mark, not three, five, etc., you could use this:
(?<![^\?]\?)'
You could expand on this concept to match other specific odd numbers of question marks. For example, this will properly not split on a quote preceded by one, three, or five question marks:
(?<![^\?]\?|[^\?]\?{3}|[^\?]\?{5})'
Working example. Lookbehinds must be fixed-width, but some engines allow an OR of the entire lookbehind. Others do not, and would require it be written as three separate lookbehinds:
(?<![^\?]\?)(?<![^\?]\?{3})(?<![^\?]\?{5})'
Obviously this is getting a bit messy, though. And it can't handle an arbitrary odd number of ?.

word range or \w in negative lookbehind

I was trying to made regex for extracting word at the place of Delhi in text
sending to: GK Delhi, where the sending to: is fixed and i don't want to capture whatever at the place of GK. Actually GK will be one word in my case, what i made which should work is: (?<=sending to: \w )Delhi, means if word starts with sending to: and ends with Delhi then return Delhi.
Please help me to fix this.
Three points,
\w matches a single word character. Use \w+ to match one or more or \w* to match zero or more word characters.
Don't forget about space between DK and Delhi: \s+.
Just a note: The (?<= construct is the positive lookbehind, not negative one.
So the regex could look like this:
(?<=sending to:\s*\w+\s+)Delhi
Please also note that arbitrary-length lookbehind is only supported by very few regex engines, but you didn't say anything about the tool you are using.
Update:
Java doesn't support arbitrary-length lookbehind expressions.
The possibilities you have are:
The matched text will always be Delhi (on successful match). So if you are only checking for a match, then you could just use the regex: sending to:\s*\w+\s+Delhi.
If you want to extend the regex to other towns in future, then you could use a capturing group. The regex would be, for example, sending to:\s*\w+\s+(Delhi|Mumbai) and in Java code you would get the city name via matcher.group(1).
Please post your actual Java code of how you are using the regex if you want a more detailed advice.

Why does my regex containing \d{1,} together with a negative lookahead still match, where it shouldn't?

I'm trying to match a coordinate pair in a String using a Regex in Java. I explicitly want to exclude strings using negative lookahead.
to be matched:
558,228
558,228,
558,228,589
558,228,A,B,C
NOT to be matched:
558,228,<Text>
The Regex ^558,228(?!,<).* does the job, while ^\d{1,},\d{1,}(?!,<).* doesn't. It's the same regex with the metacharacter \d instead of values. Any ideas why?
The reason is the .* part at the end. It matches everything that wasn't matched earlier.
In combination with \d{1,}, which allows to match less than 3 digits, it will go like this:
^\d{1,},\d{1,}(?!,<) will match 558,22 and .* will match the remaining part 8,<Text>.
The problem is the \d{1,} part in combination with the .* at the end.
In your case
558,228,<Text>
The ^\d{1,},\d{1,}(?!,<) matches ">558,22" and the .* matches the rest "8,<Text>"
You can solve this using the possessive quanitifier ++
^\d+,\d++(?!,<)(.*)
See it here online on Regexr
\d++ is a seldom used possessive quantifier, which is here useful. ++ means match at least once as many as you can and do not backtrack. That means it will not give back the digits once it has found them.
Java Quantifier tutorial

Categories