I am struggling to understand word boundary \b in regex.
I read that there are three conditions for \b.
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last character is a
word character.
Between two characters in the string, where one is a word character
and the other is not a word character.
I am trying to find the start index of the previous match using the java method start()
import java.util.regex.*;
class Quetico{
public static void main(String[] args){
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[[1]]);
System.out.print("match positions: ");
while(m.find()){
System.out.print(m.start()+" ");
}
System.out.println();
}
}
% java Quetico "\b" "^23 *$76 bc"
//string: ^23 *$76 bc pattern:\b
//index : 01234567890
produces: 1 3 5 6 7 9
I'm having trouble understanding why is produces this result. Because I'm struggling to see the pattern. Ive tried looking at the inverse, \B which produces 0 2 4 8 however this doesn't make it any clearer for me. If you can help clarify this for me it would be appreciated.
The issue isn't Java here, it's Linux/Unix. When you put text between double quote marks on the command line, most of the special shell characters such as *, ?, etc. are no longer special--except for variable interpolation. (And some other things, like ! depending on which shell flavor you're using.) Thus, if you say
% command "this $variable is interesting"
if you've set variable to value, your command will be called with one argument, this value is interesting. In your case, Linux will treat $7 as a shell script parameter, even though you're not in a shell script; since this isn't set to anything, it's replaced with an empty string, and the result is the same as if you had run
% java Quetico "\b" "^23 *6 bc"
which gives me 1 3 5 6 7 9 if I use that string literal in a Java program (instead of on the command line).
To prevent $ from being interpreted by the shell, you need to use single quote marks:
% java Quetico "\b" '^23 *$76 bc'
Related
I have input String '~|~' as the delimiter.
For example:
String s = "1~|~Vijay~|~25~|~Pune";
when I am splitting it with '~\\|~' in Java it is working fine.
String sa[] = s.split("~\\|~", -1);
for(String str : sa) {
System.out.println(str);
}
I am getting the below output.
1
Vijay
25
Pune
When the same program I am running by passing a command-line argument('~\\|~'). It is not properly parsing the string and giving it below output.
1
|
Vijay
|
25
|
Pune
Is anyone else facing the same issue? please comment on this issue.
You only need a single backslash when running it from the command line. The reason you need two when making the regular expression in Java is that backslash is used to escape the next character in a string literal or start an escape sequence so one backslash is needed to escape the next one in order for it to be interpreted literally.
~\|~
Please, do a System.out.println("[" + args[i] + "]"); to see what java is receiving from the command line, as the \ character is special for the shell and aso are the | and ~ chars (the last one expands to your home directory, which could be a problem)
You need to pass:
java foo_bar '~\|~'
(Java still needs a single \ this time to escape the vertical bar, as you are not writing a string literal for the java compiler but a simple string representing the internal representation of the above string literal, the \ character doesn't need to be escaped, as it is inside single quotes so it is passed directly to the java program) Any quoting (single or double quotes) suffices to avoid ~ expansion.
If you are passing
java foo_bar '~\\|~'
the shell will not assume the \ as a escaping character and will pass the equivalent to this String literal:
String sa[] = s.split("~\\\\|~", -1); /* to escapes mean a literal escape */
(see that now the vertical bar doesn't have its special significance)
...which is far different (you meant this time: split on one ~\ sequence, this is, a ~ followed by a backslash, or just a single ~ character, and as there are no ~s followed by a backslash, the second option was used. You should get:
1
|
Vijay
|
25
|
Pune
Which is the output you post.
You don't have to escape:
public static void main(String[] args) {
Pattern p = Pattern.compile(args[0], Pattern.LITERAL);
final String[] result = p.split("1~|~Vijay~|~25~|~Pune");
Arrays.stream(result).forEach(System.out::println);
}
Running:
javac Main.java
java Main "~|~"
Output:
1
Vijay
25
Pune
Where args[0] is equal to ~|~ (no escaping). The trick is that pattern flag, Pattern.LITERAL, which treats every character, including |, as normal character, ignoring their meta meaning.
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I am recently learning regex and i am not quite sure how the following regex works:
str.replaceAll("(\\w)(\\w*)", "$2$1ay");
This allows us to do the following:
input string: "Hello World !"
return string: "elloHay orldWay !"
From what I know: w is supposed to match all word characters including 0-9 and underscore and $ matches stuff at the end of string.
In the replaceAll method, the first parameter can be a regex. It matches all words in the string with the regex and changes them to the second parameter.
In simple cases replaceAll works like this:
str = "I,am,a,person"
str.replaceAll(",", " ") // I am a person
It matched all the commas and replaced them with a space.
In your case, the match is every alphabetic character(\w), followed by a stream of alphabetic characters(\w*).
The () around \w is to group them. So you have two groups, the first letter and the remaining part. If you use regex101 or some similar website you can see a visualization of this.
Your replacement is $2 -> Second group, followed by $1(remaining part), followed by ay.
Hope this clears it up for you.
Enclosing a regex expression in brackets () will make it a Capturing group.
Here you have 2 capturing groups , (\w) captures a single word character, and (\w*) catches zero or more.
$1 and $2 are used to refer to the captured groups, first and second respectively.
Also replaceAll takes each word individually.
So in this example in 'Hello' , 'H' is the first captured groups and 'ello' is the second. It's replaced by a reordered version - $2$1 which is basically swapping the captured groups.
So you get '$2$1ay' as 'elloHay'
The same for the next word also.
I want to match strings enclosed in triple "-quotes which may contain line breaks, and which don't contain any """-substrings except at the very beginning and in the very end.
Valid example:
"""foo
bar "baz" blah"""
Invalid example:
"""foo bar """ baz"""
I tried using the following regex (as Java String literal):
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""
and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world, it gives me a StackOverflowError.
Scala snippet to reproduce the error
import java.util.regex.{Pattern, Matcher}
val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())
(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).
Java snippet that produces the same error
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegexOverflowMain {
public static void main(String[] args) {
StringBuilder bldr = new StringBuilder();
bldr.append("\"\"\"");
for (int i = 0; i < 1000; i++) {
bldr.append("hello world \n");
}
bldr.append("\"\"\"");
String text = bldr.toString();
Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
System.out.println(p.matcher(text).lookingAt());
}
}
Question
Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError when fed to the Java regex API?
I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.
Solution using a negative look-ahead to basically find a string that starts with """ and end with """ and contains content that does not include """
As Plain regex: ^"""((?!""")[\s\S])*"""$
As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$"
\s\S includes line-break (its basically . + line-break or . with single line flag)
This should be used without the multiline flag so that ^ and $ match the start and end of the string and not the start and end of the line
otherwise this:
""" ab
"""abc"""
abc """
would match
Also i used this as reference for how to exclude the """: Regular expression to match a line that doesn't contain a word?
The full answer below optimizes the regex performance, but to prevent stack overflow, as a simple solution, just make the repeating group possessive.
Non-possessive repeating groups with choices need recursive calls to allow backtracking. Making it possessive fixes the problem, so simply add a + after the *:
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*+\"\"\""
Also note that if you want to match entire input, you need to call matches(), not lookingAt().
Performance boost
Note: A quick performance test showed this to be more than 6 times faster than regex in answer by x4rf41.
Instead of matching one of
Not a quote: [^\"]
Exactly one quote: (?:\"[^\"])
Exactly two quotes: (?:\"\"[^\"])
in a loop, first match everything up to a quote. If that is a single- or double-quote, but not a triple-quote, match the 1-2 quotes then everything up to next quote, repeat as needed. Finally match the ending triple-quote.
That matching is definitive, so make the repeats possessive. This also prevent stack overflow in case input has many embedded quotes.
"{3} match 3 leading quotes
[^"]*+ match as many non-quotes as possible (if any) {possesive}
(?: start optional repeating group
"{1,2} match 1-2 quotes
[^"]++ match one or more non-quotes (at least one) {possesive}
)*+ end optional repeating group {possesive}
"{3} match 3 trailing quotes
Since you don't use ^ or $, there is no need for (?m) (MULTILINE)
As Java string:
"\"{3}[^\"]*+(?:\"{1,2}[^\"]++)*+\"{3}"
Im trying to create a regex of numbers where 7 should appear atleast once and it shouldn't include 9
/[^9]//d+
I'm not sure how to do make it include 7 at least once
Also, it fails for the following example
123459, it accepts the string, even tho, there is a 9 included in there
However, if my string is 95, it rejects it, which is right
Code
Method 1
See regex in use here
(?=\d*7)(?!\d*9)\d+
Method 2
See regex in use here
\b(?=\d*7)[0-8]+\b
Note: This method uses fewer steps (170) as opposed to Method 1 with 406 steps.
Alternatively, you can also replace [0-8] with [^9\D] as seen here, which is basically saying don't match 9 or \D (any non-digit character).
You can also use \b(?=[^7\D]*7)[0-8]+\b as seen here, which brings the number of steps down from 170 to 147.
Method 3
See regex in use here
\b[0-8]*7[0-8]*\b
Note: This method uses few steps than both methods above at 139 steps. The only issue with this regex is that you need to identify valid characters in multiple locations in the pattern.
Results
Input
**VALID**
123456780
7
1237412
**INVALID**
9
12345680
1234567890
12341579
Output
Note: Shown below are strings that match.
123456780
7
1237412
Explanation
Method 1
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
(?!\d*9) Negative lookahead ensuring what follows is not any digit any number of times, followed by 9 literally
\d+ Any digit one or more times
Method 2
\b Assert the position as a word boundary
(?=\d*7) Positive lookahead ensuring what follows is any digit any number of times, followed by 7 literally
[0-8]+ Match any character present in the set 0-8
\b Assert the position as a word boundary
Method 3
\b Assert the position as a word boundary
[0-8]* Match any digit (except 9) any number of times
7 Match the digit 7 literally
[0-8]* Match any digit (except 9) any number of times
\b Assert the position as a word boundary
One way to do it would be to use several lookaheads:
(?=[^7]*7)(?!.*9)^\d+$
See a demo on regex101.com.
Note that you need to double escape the backslashes in Java, so that it becomes:
(?=[^7]*7)(?!.*9)^\\d+$
This has got a bit complex but it works for your use case :
(?=.*^[0-68-9]*7[0-68-9]*$)(?=^(?:(?!9).)*$).*$
First expression matches exactly one occurence of 7, accepts just numbers and second expression tests non-occurence of 9.
Try here : https://regex101.com/r/5OHgIr/1
If I find out correctly, you need a regex that accept all numbers that include at least one 7 and exclude 9. so try this:
(?:[0-8]*7[0-8]*)+
If you want found only numbers in a normal text add \s first and last of regex.
I'm trying to add spaces between numbers but as result some numbers get split and other sometimes lost.
Code:
String line = "321HELLO how do you do? $ah213 -20d1001x";
line = line.replaceAll("([^d]?)([\\d\\.]+)([^d]?)", "$1 $2 $3");
System.out.println(line);
result:
3 21 HELLO how do "you" do? $ah 213 - 2 0 d1 001 x
Rules:
No matter how big integer is dont split it in many parts.
$ + number ($123) or $ +letter + number ($abc123) dont add space before & after number.
Letter + number = separate it.
Wanted result:
321 HELLO how do "you" do? $ah213 -20 d 1001 x
One small mistake in your regex: [^d] should be [^\\d], otherwise you're checking for the character d rather than the character class \d.
But it still inserts too many spaces, I don't really see a way to avoid that with your current regex.
Something that works:
String line = "321HELLO how do you do? $ah213 -20d1001x";
line = line.replaceAll("(?<=[-\\d.])(?=[^\\s-\\d.])|(?<!\\$[a-z]{0,1000})(?<=[^\\s-\\d.])(?=[-\\d.])", " ");
System.out.println(line);
prints:
321 HELLO how do you do? $ah213 -20 d 1001 x
Explanation:
[-\\d.] is what I presume you classify as "part of a number" (although a . alone will get treated as a number, which may not be desired) (you don't need to escape . inside []).
(?<=...) is positive look-behind, meaning the previous characters match the pattern.
(?=...) is positive look-ahead, meaning the next characters match the pattern.
(?<!...) is negative look-behind, meaning the previous characters don't match the pattern.
So basically whenever you get to a place that is a switching point between number and not number, insert a space (if one doesn't already exist). And the negative look-behind prevents a space from being inserted whenever there is a $ followed by 0-1000 (can't use * in look-behind) letters (will prevent spaces with $123 and $ah123).
Java regex reference.
Additional note:
Turns out you don't really need the ?<= at all, this can be matched regularly.
replaceAll("([-\\d.])(?=[^\\s-\\d.])|(?<!\\$[a-z]{0,1000})([^\\s-\\d.])(?=[-\\d.])", "$1$2 ")