Beginning index of every word

Beginning index of every word - java

I want to get the beginning index of every word in a string. Word is defined by anything non whitespace character.
String test = "this that and that";
Matcher matcher = Pattern.compile("\\s+[WHAT TO WRITE HERE]\\s+").matcher(test);
while (matcher.find()) {
System.out.println(matcher.start());
}
What should I write in the regular expression? For e.g. the output should be 0,5,10,14
There can be multiple whitespaces between words.

Word is defined by anything non whitespace character.
And there is a character class for that: \S.
Your regex should therefore be:
private static final Pattern PATTERN = Pattern.compile("\\S+");
Note however that the definition of "word" you have is rather large; this will also include punctuation etc.
As to your loop, it is correct, since when you have a match, the Matcher's .start() method will indeed contain the index at which the match has started.
Taking your code and modifying it a little, this gives:
String test = "this that and that";
Matcher matcher = PATTERN.matcher(test);
while (matcher.find()) {
System.out.println(matcher.start());
}

I would use this regex:
...
Matcher matcher = Pattern.compile("[^\\s]+").matcher(test);
...

I would use :
[A-Za-z0-9]+
It will find only alpha-numeric word.
I think "\S+" will be problematic with punctuation marks and weird chars.
You can even drop the numeric ("0-9") part if you want.

#fge already gave the best answer but since I can't reply to his comment. #Ian McGrath you were asking what you could have written well other solutions exist. This is what I came up with and it seemed to work also.
Matcher matcher = Pattern.compile("\\w+?(\\s+|$)").matcher(test);

Related

Match starting and ending character using Java Matcher class

I want to get words from string that starts with # and end with space. I've tried using this Pattern.compile("#\\s*(\\w+)") but it doesn't include characters like ' or :.
I want the solution with only Pattern Matching method.

We can try matching using the pattern (?<=\\s|^)#\\S+, which would match any word starting with #, followed by any number of non whitespace characters.
String line = "Here is a #hashtag and here is #another has tag.";
String pattern = "(?<=\\s|^)#\\S+";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(line);
while (m.find()) {
System.out.println(m.group(0));
}
#hashtag
#another
Demo
Note: The above solution might give you an edge case of pulling in punctuation which appears at the end of a hashtag. If you don't want this, then we can rephrase the regex to only match positive certain characters, e.g. letters and numbers. But, maybe this is not a concern for you.

The opposite of \s is \S, so you can use a regex like this:
#\s*(\S+)
Or for Java:
Pattern.compile("#\\s*(\\S+)")
It will capture anything that is not a white space.
See demo here.
If you want to stop on the space character and not any white space change the \S to [^ ].
The ^ inside the brackets means it will negate whatever comes after it.
Pattern.compile("#\\s*([^ ]+)")
See demo here.

Java Regex Look-Behind Doesn't Work

So I am working on regex comparing phone numbers and this is the result:
(?:(?:0{2}|\+)?([1-9][0-9]))? ?([1-9][0-9])? ?([1-9][0-9]{5})
As you can see there are spaces between the numbers. I want them to appear only when there is some other number before the space so:
"0022 45 432345" - should match
"45 345678" or "560032" - should match
" 324400" - shouldn't match because of the space in the beginning
I've been reading different tutorials about regexes and found out about look-behinds, but simple construction like that(just for test):
Pattern p2 = Pattern.compile("(?<=abc)aa");
Matcher m2 = p2.matcher("abcaa");
doesn't work.
Can you tell me what's wrong?
Another problem is - I want a character only happen when it is THE FIRST character in a string, otherwise it shouldn't occur. So the code:
0043 022 234567 should not work, but 022 123450 should match.
I'm stuck right now and would appreciate any help a lot.

This should work just fine. The spaces are moved into the optional groups and are themselves optional. This way, they only match if the group before them is present, but even then they are still optional. No look-behind required.
(?:(?:(?:00|\+)?([1-9][0-9]) ?)?([1-9][0-9]) ?)?([1-9][0-9]{5})

Lookbehind is a zero length match.
The javadoc for the Matcher.matches method determines if the whole String is a match.
What you're looking for is something the Matcher.find and Matcher.group methods. Something like:
final Pattern pattern = Pattern.compile("(?<=abc)aa");
final Matcher matcher = pattern.matcher("abaca");
final String subMatch;
if (matcher.find()) {
subMatch = matcher.group();
} else {
subMatch = "";
}
System.out.println(subMatch);
Example.

X? regex quantifier doesn't work as expected (by me)

Input string:
aaa---foo---ccc---ddd
aaa---bar---ccc---ddd
aaa---------ccc---ddd
Regex: aaa.*(foo|bar)?.*ccc.*(ddd)
This regex doesn't find first group (foo|bar) in any cases. It always returns null for capture group 1.
My question is why and how can I avoid that.
It's very oversimplified example of my regex for just demonstrating. It works if I remove ? quantifier but input string can be without this group at all (aaa---------ccc---ddd) and I still need to determine if it is foo or bar or null. But group 1 is always null.
Page with this regex and test strings: http://fiddle.re/45c766

Here's why it doesn't work: When you have .* in a pattern, the matcher's algorithm is to try to match as many characters as it can to make the rest of the pattern work. In this case, if it tries starting with the entire remainder of the string as .* and removing one character until it matches, it finds that (for "aaa---foo---ccc---ddd") it will work to have .* match 9 characters; then (foo|bar)? doesn't match anything, which is OK because it's optional; and the next .* matches 0 characters, and then the rest of the pattern matches. So that's the one it selects.
The reason changing .* to .*?:
aaa.*?(foo|bar)?.*?ccc.*(ddd)
doesn't work is that the matcher does the same thing in reverse. It starts with a 0-character match and then figures out if it can make the pattern work. When it tries this, it will find that it works to make .*? match 0 characters; then (foo|bar)? doesn't match anything; then the second .*? matches 9 characters; then the rest of the pattern matches ccc---ddd. So either way, it won't do what you want.
There are a couple solutions in the answers, both involving lookahead. Here's another solution:
aaa.*(foo|bar).*ccc.*(ddd)|aaa.*ccc.*(ddd)
This basically checks for two patterns, in order; first it checks to see if there's a pattern with foo|bar in it, and if that doesn't match, it will then search for the other possibility, without foo|bar. This will always find foo|bar if it's there.
All of these solutions involve rather difficult-to-read regexes, though. This is how I might code it:
Pattern pat1 = Pattern.compile("aaa(.*)ccc.*ddd");
Pattern pat2 = Pattern.compile("foo|bar");
Matcher m1 = pat1.matcher(source);
String foobar;
if (m1.matches()) {
Matcher m2 = pat2.matcher(m1.group(1));
if (m2.find()) {
foobar = m2.group(0);
} else {
foobar = null;
}
}
Often, attempting to use one whiz-bang regex to solve a problem results in less-readable (and possibly less-efficient) code than just breaking the problem into parts.

Change your regex to the below if you want to capture the inbetween foo or bar strings.
aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)
Because the .* would also eats up the in-between strings foo or bar, you could use (?:(?!foo|bar).)* instead of that. This (?:(?!foo|bar).)* regex would match any character but not of foo or bar zero or more times.
DEMO
String s = "aaa---foo---ccc---ddd\n" +
"aaa---bar---ccc---ddd\n" +
"aaa---------ccc---ddd";
Pattern regex = Pattern.compile("aaa(?:(?!foo|bar).)*(foo|bar)?.*?ccc.*?(ddd)");
Matcher matcher = regex.matcher(s);
while(matcher.find()){
System.out.println(matcher.group(1));
}
Output:
foo
bar
null

Try:
.{3}\-{3}(.{3})\-{3}.{3}\-{3}(.{3})

Regular expression to find substring in text

I have a text file contains some strings I want to extract with Java regex,
Those strings are in format of:
$numbers,numbers,numbers....,numbers##
(start with $, followed by groups of numbers plus ,, and end with ##)
Here is my pattern.
Pattern pattern = Pattern.compile("$*##");
Matcher matcher = pattern.matcher(text);
if (matcher.find())
{
}
It turns out that nothing match my pattern
Can anyone tell me what's wrong with it?

You need to do:
Pattern pattern = Pattern.compile("\\$\\$\\d+(,\\d+)*##$");
Thanks to #Pshemo for his valuable inputs to reach the solution.

String class regular expression difficulty

I want to get the first word of astring containing alphanumeric field
EG.
string can be 'abc123abc' or 'abc-123abc'
i just want the first 'abc'
is there any way to get it without for loop(I want to do this using regex but i don't know much about regular expression)
actually string pattern is like
[A-Za-z]{2,5}[-]{0,1}[0-9]{1,15}[A-Za-z]{0,15}
My aim is to get the first word

Wrap the part of the expression that you would like to capture in a capturing group, and then use group(1) of the matcher to access it:
([A-Za-z]{2,5})-?[0-9]{1,15}[A-Za-z]{0,15}
The first group will capture everything up to the optional dash:
Pattern p = Pattern.compile("([A-Za-z]{2,5})-?[0-9]{1,15}[A-Za-z]{0,15}");
Matcher m = p.matcher("abc123abc");
if (m.find()) {
System.out.println(m.group(1));
}
The above prints abc (link to ideone).

Try as
System.out.println("abc-123abc".split("[-\\d]+")[0]);
output
abc

^[A-Za-z]+
will match ASCII letters at the start of the string. Is that what you need?

You can get the matched text for ^[A-Za-z]{2,5}. This will match all the first letters.

String word = "abc-123abc".replaceFirst("[^a-zA-Z].*$", "");
This removes everything after the first non a-z character. You can also use replace with capturing groups.
String word = "abc-123abc".replaceFirst("^([a-zA-Z]+).*$", "$1");
String.replaceFirst()

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Beginning index of every word - java

I would use this regex: ... Matcher matcher = Pattern.compile("[^\\s]+").matcher(test); ...

I would use : [A-Za-z0-9]+ It will find only alpha-numeric word. I think "\S+" will be problematic with punctuation marks and weird chars. You can even drop the numeric ("0-9") part if you want.

#fge already gave the best answer but since I can't reply to his comment. #Ian McGrath you were asking what you could have written well other solutions exist. This is what I came up with and it seemed to work also. Matcher matcher = Pattern.compile("\\w+?(\\s+|$)").matcher(test);

Related

Match starting and ending character using Java Matcher class

Java Regex Look-Behind Doesn't Work

X? regex quantifier doesn't work as expected (by me)

Regular expression to find substring in text

String class regular expression difficulty

Categories

Resources