java regex until certain word/text/characters

java regex until certain word/text/characters - java

Please consider the following text :
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.
How can I get the following result :
That is, it matches at any position that has a non-word character to the
That is everything until left

input.replace("^(.*?)\\bleft.*$", "$1");
^ anchors to the beginning of the string
.*? matches as little as possible of any character
\b matches a word boundary
left matches the string literal "left"
.* matches the remainder of the string
$ anchors to the end of the string
$1 replaces the matched string with group 1 in ()
If you want to use any word (not just "left"), be careful to escape it. You can use Pattern.quote(word) to escape the string.

The answer is actually /(.*)\Wleft\w/ but it won't match anything in
That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.

String result = inputString.replace("(.*?)left.*", "$1");

Related

Matching whole words with special characters with a dynamically built pattern

I need to match an exact substring in a string in Java. I've tried with
String pattern = "\\b"+subItem+"\\b";
But it doesn't work if my substring contains non alphanumerical characters.
I want this to work exactly as the "Match whole word only" function in Notepad++.
Could you help?

I suggest either unambigous word boundaries (that match a string only if the search pattern is not enclosed with letters, digits or underscores):
String pattern = "(?<!\\w)"+Pattern.quote(subItem)+"(?!\\w)";
where (?<!\w) matches a location not preceded with a word char and (?!\w) fails if there is no word char immediately after the current position (see this regex demo), or, you can use a variation that takes into account leading/trailing special chars of the potential match:
String pattern = "(?:\\B(?!\\w)|\\b(?=\\w))" + Pattern.quote(subword) + "(?:(?<=\\w)\\b|(?<!\\w)\\B)";
See the regex demo.
Details:
(?:\B(?!\w)|\b(?=\w)) - either a non-word boundary if the next char is not a word char, or a word boundary if the next char is a word char
Data\[3\] - this is a quoted subItem
(?:(?<=\w)\b|(?<!\w)\B) - either a word boundary if the preceding char is a word char, or a non-word boundary if the preceding char is not a word char.

Regex not matching against ampersand

I'm trying to match the following regex:
\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\.?\b
In other words, a word boundary followed by any of the strings above (optionally followed by a period character) followed by a word boundary.
I'm trying to match this in Java, but the ampersand will not match. For example:
Pattern p = Pattern.compile(
"\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|&|and)\\.?\\b",
Pattern.CASE_INSENSITIVE);
String result = p.matcher("mr one and mrs.two and three & four").replaceAll(" ");
System.out.println("["+result+"]");
The output of this is: [ one two three & four]
I've also tried this at regex101, and again the ampersand does not match: https://regex101.com/r/klkmwl/1
Escaping the ampersand does not make a difference, and I've tried using the hex escape sequence \x26 instead of ampersand (as suggested in this question). Why is this not matching?

Your regex will match an ampersand if it is located in between word chars, e.g. three&four, see this regex demo. This happens because \b before a non-word char requires a word char to appear immediately before it. Also, as there is a \b after an optional dot, both the dot and ampersand will only match if there is a word char immediately on the left.
You need to re-write the pattern so that the word boundaries are applied to the words rather than symbols:
Pattern p = Pattern.compile(
"(?:\\b(?:mr|mrs|ms|miss|messrs|mmes|dr|prof|rev|sr|jr|and)\\b|&)\\.?",
Pattern.CASE_INSENSITIVE);
See the regex demo online.

Problem is due to use of word boundaries. There are no word boundaries before or after a non-word character like &.
In place of word boundary you can use lookarounds:
(?<!\w)(?:[jsdm]r|mr?s|miss|messrs|mmes|prof|re|&|and)\.?(?!\w)
Updated RegEx Demo
(?<!\w): Make sure that previous character is not a word character
(?!\w): Make sure that next character is not a word character
Note some tweaks in your regex to make it shorter.

Java Pattern regex search between strings

Given the following strings (stringToTest):
G2:7JAPjGdnGy8jxR8[RQ:1,2]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
G2:7JAPjGdnGy8jxR8[RQ:3,4]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
And the Pattern:
Pattern p = Pattern.compile("G2:\\S+RQ:3,4");
if (p.matcher(stringToTest).find())
{
// Match
}
For string 1 I DON'T want to match, because RQ:3,4 is associated with the G3 section, not G2, and I want string 2 to match, as RQ:3,4 is associated with G2 section.
The problem with the current regex is that it's searching too far and reaching the RQ:3,4 eventually in case 1 even though I don't want to consider past the G2 section.
It's also possible that the stringToTest might be (just one section):
G2:7JAPjGdnGy8jxR8[RQ:3,4]
The strings 7JAPjGdnGy8jxR8 and jRo6pN8ZW9aglYz are variable length hashes.
Can anyone help me with the correct regex to use, to start looking at G2 for RQ:3,4 but stopping if it reaches the end of the string or -G (the start of the next section).

You may use this regex with a negative lookahead in between:
G2:(?:(?!G\d+:)\S)*RQ:3,4
RegEx Demo
RegEx Details:
G2:: Match literal text G2:
(?: Start a non-capture group
(?!G\d+:): Assert that we don't have a G<digit>: ahead of us
\S: Match a non-whitespace character
)*: End non-capture group. Match 0 or more of this
RQ:3,4: Match literal text RQ:3,4
In Java use this regex:
String re = "G2:(?:(?!G\\d+:)\\S)*RQ:3,4";

The problem is that \S matches any whitespace char and the regex engine parses the text from left to right. Once it finds G2: it grabs all non-whitespaces to the right (since \S* is a ghreedy subpattern) and then backtracks to find the rightmost occurrence of RQ:3,4.
In a general case, you may use
String regex = "G2:(?:(?!-G)\\S)*RQ:3,4";
See the regex demo. (?:(?!-G)\S)* is a tempered greedy token that will match 0+ occurrences of a non-whitespace char that does not start a -G substring.
If the hyphen is only possible in front of the next section, you may subtract - from \S:
String regex = "G2:[^\\s-]*RQ:3,4"; // using a negated character class
String regex = "G2:[\\S&&[^-]]*RQ:3,4"; // using character class subtraction
See this regex demo. [^\\s-]* will match 0 or more chars other than whitespace and -.

Try to use [^[] instead of \S in this regex: G2:[^[]*\[RQ:3,4
[^[] means any character but [
Demo
(considering that strings like this: G2:7JAP[jGd]nGy8[]R8[RQ:3,4] are not possible)

Find regular expression of length specified and starting and ending also specified in Java

I want to find all the words of length 3 with starting with 'l' and ending with 'f'.
Here's my code:
Pattern pt = Pattern.compile("\\bl.+?f{3}\\b");
Matcher mt = pt.matcher("#Java life! Go ahead Java,lyf,fly,luf,loof");
while(mt.find()) {
System.out.println(mt.group());
}
It's showing nothing. tried out this also Pattern pt = Pattern.compile("l.+?f{3}"); still not getting expected o/p.
The o/p should be:
lyf luf

You can use a word boundary \b, then match for l, a word character \w and then f ending with a word boundary \b.
\bl\wf\b
Explanation
Match a word boundary \b
Match l
Match a word character \w (\w is a shorthand character, matches the ASCII characters [A-Za-z0-9_])
Match a f
Match a word boundary \b
Demo

The regex you need is
\bl\wf\b
Explanation:
Since your word must be three character long, that means there can only be one letter between l and f, so that's why I didn't put a quantifier there.
Your regex is wrong because
f{3} means 3 f's, not 3 character long in total
. matches everything, including non word characters. Use \w instead.

word boundary that rejects leading/end non-alphanumeric character

Right now I'm learning regular expression on Java and I have a question about the word boundaries. So when I looking for word boundaries on Java Regular Expression, I got this \b that accepts word bordered by non-word character so this regex
\b123\b
will accepts this string 123 456 but will rejects 456123456. Now I found that a condition like the word !$###%123^^%$# or "123" still got accepted by the regex above. Is there any word boundaries/pattern that rejects word that bordered by non-alphanumeric (except space) like the example above?

You want to use \s instead of \b. That will look for a whitespace character rather than a word boundary.
If you want your first example of 123 456 to be a match, however, then you will also need to use anchors to accept 123 at the immediate start or end of the string. This can be accomplished via (\s|^)123(\s|$). The carat ^ matches the start of the string and $ matches the end of the string.

(?<!\S)123(?!\S)
(?<!\S) matches a position that is not preceded by a non-whitespace character. (negative lookbehind)
(?!\S) matches a position that is not followed by a non-whitespace character. (negative lookahead)
I know this seems gratuitously complicated, but that's because \b conceals a lot of complexity. It's equivalent to this:
(?<=\w)(?!\w)|(?=\w)(?<!\w)
...meaning a position that's preceded by a word character and not followed by one, or a position that's followed by a word character and not preceded by one.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regex until certain word/text/characters - java

The answer is actually /(.*)\Wleft\w/ but it won't match anything in That is, it matches at any position that has a non-word character to the left of it, and a word character to the right of it.

String result = inputString.replace("(.?)left.", "$1");

Related

Matching whole words with special characters with a dynamically built pattern

Regex not matching against ampersand

Java Pattern regex search between strings

Find regular expression of length specified and starting and ending also specified in Java

word boundary that rejects leading/end non-alphanumeric character

Categories

Resources