Space getting consumed during regex

Space getting consumed during regex - java

Using regex to pull out words with length of 5 with space before and after. Thus all the following words should match my pattern. But it seems after matching the first word, the space is consumed which makes the second word fail the match.
To illustrate, I should/ want to get the printout as:
apple orange pines dorms
Instead, I get:
apple pines
How can I handle this issue?
Code:
public static void main(String[] args) {
String myStr = " apple orange pines dorms ";
regexChecker("(\\s[A-Za-z]{5}\\s)", myStr);
}
public static void regexChecker(String regex, String strToCheckOn){
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(strToCheckOn);
while (m.find()){
if(m.group().length() != 0){
System.out.println(m.group(1));
}
System.out.println();
}
}

You need to use lookahead and lookbehind instead of consuming spaces before/after words:
(?<=\\s|^)[A-Za-z]{5,}(?=\\s|$)
RegEx Demo
(?<=\\s|^) is lookbehind that asserts we have line start or a whitespace before our match
(?=\\s|$) is lookahead that asserts we have line end or a whitespace after our match

Related

regex find string between 2 characters, seperated by comma

I am new to regular expression and i want to find a string between two characters,
I tried below but it always returns false. May i know whats wrong with this ?
public static void main(String[] args) {
String input = "myFunction(hello ,world, test)";
String patternString = "\\(([^]]+)\\)";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
Input:
myFunction(hello,world,test) where myFunction can be any characters. before starting ( there can be any characters.
Output:
hello
world
test

You could match make use of the \G anchor which asserts the position at the end of the previous match and and capture your values in a group:
(?:\bmyFunction\(|\G(?!^))([^,]+)(?:\h*,\h*)?(?=[^)]*\))
In Java:
String regex = "(?:\\bmyFunction\\(|\\G(?!^))([^,]+)(?:\\h*,\\h*)?(?=[^)]*\\))";
Explanation
(?: Non capturing group
\bmyFunction\( Word boundary to prevent the match being part of a larger word, match myFunction and an opening parentheses (
| Or
\G(?!^) Assert position at the end of previous match, not at the start of the string
) Close non capturing group
([^,]+) Capture in a group matching 1+ times not a comma
(?:\h*,\h*)? Optionally match a comma surrounded by 0+ horizontal whitespace chars
(?=[^)]*\)) Positive lookahead, assert what is on the right is a closing parenthesis )
Regex demo | Java demo
For example:
String patternString = "(?:\\bmyFunction\\(|\\G(?!^))([^,]+)(?:\\h*,\\h*)?(?=[^)]*\\))";
String input = "myFunction(hello ,world, test)";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Result
hello
world
test

I'd suggest you to achieve this in a two-step process:
Step 1: Capture all the content between ( and )
Use the regex: ^\S+\((.*)\)$
Demo
The first and the only capturing group will contain the required text.
Step 2: Split the captured string above on ,, thus yielding all the comma-separated parameters independently.

See this you may get idea
([\w]+),([\w]+),([\w]+)
DEMO: https://rubular.com/r/9HDIwBTacxTy2O

Regex - How to discard match

How can I get a regular expression to discard a part of the match?
public class main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("(?<=b)([xyz])(?:a*?)c");
String string = "abyaacbxaaac";
Matcher matcher = pattern.matcher(string);
while(matcher.find()){
System.out.println(matcher.group());
}
}
}
The output here is:
yaac
xaaac
I'd like it to output only y and x when I run System.out.println(matcher.group());
I.e. Discarding what is matched by(?:a*?)
P.S.
I know I can use matcher.group(1) to get x and y on its own but I'd like the entire match to output x and y only without having to access specific groups.

You can use lookarounds in your regex to get only the part you need in match:
(?<=b)[xyz](?=a*c)
RegEx Demo
(?=a*c) is a positive lookahead to assert that we have 0 or more a followed by a c ahead. This is a zero width assertion so your match will still be one of [xyz] characters.

JAVA split with regex doesn't work

I have the following String 46MTS007 and i have to split numbers from letters so in result i should get an array like {"46", "MTS", "007"}
String s = "46MTS007";
String[] spl = s.split("\\d+|\\D+");
But spl remains empty, what's wrong with the regex? I've tested in regex101 and it's working like expected (with global flag)

If you want to use split you can use this lookaround based regex:
(?<=\d)(?=\D)|(?<=\D)(?=\d)
RegEx Demo
Which means split the places where next position is digit and previous is non-digit OR when position is non-digit and previous position is a digit.
In Java:
String s = "46MTS007";
String[] spl = s.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

Regex you're using will not split the string. Split() splits the string with regex you provide but regex used here matches with whole string not the delimiter. You can use Pattern Matcher to find different groups in a string.
public static void main(String[] args) {
String line = "46MTS007";
String regex = "\\D+|\\d+";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(line);
while(m.find())
System.out.println(m.group());
}
Output:
46
MTS
007
Note: Don't forget to user m.find() after capturing each group otherwise it'll not move to next one.

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?

You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")

If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))

You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!

With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet

By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here

I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Finding a Match using java.lang.String.matches()

I have a String that contains new line characters say...
str = "Hello\n"+"Batman,\n" + "Joker\n" + "here\n"
I would want to know how to find the existance of a particular word say .. Joker in the string str using java.lang.String.matches()
I find that str.matches(".*Joker.*") returns false and returns true if i remove the new line characters. So what would be the regex expression to be used as an argument to str.matches()?
One way is... str.replaceAll("\\n","").matches(.*Joker.*);

The problem is that the dot in .* does not match newlines by default. If you want newlines to be matched, your regex must have the flag Pattern.DOTALL.
If you want to embed that in a regex used in .matches() the regex would be:
"(?s).*Joker.*"
However, note that this will match Jokers too. A regex does not have the notion of words. Your regex would therefore really need to be:
"(?s).*\\bJoker\\b.*"
However, a regex does not need to match all its input text (which is what .matches() does, counterintuitively), only what is needed. Therefore, this solution is even better, and does not require Pattern.DOTALL:
Pattern p = Pattern.compile("\\bJoker\\b"); // \b is the word anchor
p.matcher(str).find(); // returns true

You can do something much simpler; this is a contains. You do not need the power of regex:
public static void main(String[] args) throws Exception {
final String str = "Hello\n" + "Batman,\n" + "Joker\n" + "here\n";
System.out.println(str.contains("Joker"));
}
Alternatively you can use a Pattern and find:
public static void main(String[] args) throws Exception {
final String str = "Hello\n" + "Batman,\n" + "Joker\n" + "here\n";
final Pattern p = Pattern.compile("Joker");
final Matcher m = p.matcher(str);
if (m.find()) {
System.out.println("Found match");
}
}

You want to use a Pattern that uses the DOTALL flag, which says that a dot should also match new lines.
String str = "Hello\n"+"Batman,\n" + "Joker\n" + "here\n";
Pattern regex = Pattern.compile("".*Joker.*", Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(str);
if (regexMatcher.find()) {
// found a match
}
else
{
// no match
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Space getting consumed during regex - java

Related

regex find string between 2 characters, seperated by comma

Regex - How to discard match

JAVA split with regex doesn't work

regex last word in a sentence ending with punctuation (period)

Finding a Match using java.lang.String.matches()

Categories

Resources