Counting words with regular expression "\S+" - java

Why does wordCount end up being 1, rather than 5, in the code below?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class WordCount {
public static void main(String[] args) {
final Pattern wordCountRegularExpression = Pattern.compile("\\S+");
final Matcher matcher = wordCountRegularExpression
.matcher("one two three four five");
int wordCount = 0;
while (matcher.find()) {
wordCount++;
}
System.out.println("wordCount: " + wordCount);
}
}
Doesn't the pattern "\S+" match a word, since it means one or more non-space characters?
This does work by the way:
final Pattern wordCountRegularExpression = Pattern.compile("\\b\\w+\\b");
But I still don't understand why the original code doesn't work.

Doesn't the pattern "\S+" match a word, since it means one or more non-space characters?
Yes.

Using
import java.util.regex.*;
in java 7, the following pattern:
Pattern.compile("\\S+");
Will not count word, but spaces.
So, it should return 4 for the input: "one two three four five", since it have 4 spaces.

It depends on what you're using to separate the words. When I copy the code from your question into my editor, I see plain old spaces (U+0020), but when I viewsource the page I see non-breaking spaces (U+00A0). Java doesn't recognize the NBSP as a whitespace character.
Now the question is why am I seeing NBSP's in the string literal, but nowhere else? And why are they being converted to spaces when I copy/paste? Is anyone else seeing that?

Related

Regular Expression for ")" matching parentheses

Every smiling face must have a smiling mouth that should be marked with either ) or D.
I tried to do this using the following code:
import java.util.*;
import java.util.regex.Pattern;
public class SmileFaces {
public static int countSmileys(List<String> arr) {
String regx = "/^((:|;)(-|~)?|D|//))$/";
int count=0;
ListIterator<String> itr=arr.listIterator();
while(itr.hasNext()){
if(Pattern.matches(regx,itr.next())){
count++;
}
}
return count;
}
}
I have tried this regex for smiling checking: /^((:|;)(-|~)?|D|//))$/
You could just patch your current regex by correctly escaping \\) with two backslashes, but I think character classes are easier to read here:
String regx = "^[;:][~-]?[D)]$";
Note that Java regex patterns do not take delimiters as they would in another language such as PHP or Python, so I removed them from your pattern. Also, if you wanted to use the above pattern with certain methods, such as String#matches, you could remove the ^ and $ anchors.

Java Regular Expression to check for fixed length and more

I am not even sure if regular expressions are the best way to do this. Here is the requirement on a string:
To check length is 13 characters
First and Last 2 characters are always characters only.
Characters from 3 - 11 are numeric.
Please suggest whether regular expression is the best way to do it and what the regular expression would like to check such a thing?
Regards
Akhil
Use e.g.
"^[a-z]{2}[0-9]{9}[a-z]{2}$"
The square brackets say what is allowed, 'a-z' means small alphabetics between a and z. The curly says how many must be there. ^ means no characters before this, and $ means no characters after.
Usage:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class MatcherExample {
public static void main(String[] args) {
String text = "aa123456789bb";
String patternString = "^[a-z]{2}[0-9]{9}[a-z]{2}$";
Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
boolean matches = matcher.matches();
System.out.println("Matches: " + matches);
}
}

Java regex only bashslash(\\) not working

I am incorporating a pattern with has a backslash(\) with an escape sequence once.But that is not working at all.I am getting result as no match.
package com.test;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TestClassRegex {
private static final String VALIDATION = "^[0-9\\-]+$";
public static void main(String[] args) {
String line = "1234\56";
Pattern r = Pattern.compile(VALIDATION);
Matcher m = r.matcher(line);
if (m.matches()) {
System.out.println("match");
}
else {
System.out.println("no match !!");
}
}
}
How can I write a pattern which can recognize backslash literally.
I have actually seen another post :
Java regular expression value.split("\\."), "the back slash dot" divides by character?
which doesn't answer my question completely.Hence needs some heads up here.
"1234\56" will not produce "123456" but instead "1234."
Why?
The \ in a String is used to refer to the octal value of a character in the ASCII table. Here, you're calling \056 which is the character number 46 in the ASCII table and is represented by .
That's exactly the reason why you're not getting a match here.
Solution
You should first of all change your regex to ^[0-9\\\\-]+$ because in Java you need to escape the \ in a String. Even if your initial RegEx does not do it.
Your input needs to look like 1234\\56 for the same reason as above.

Match String ending with (regex) java

I am following the suggestions on the page, check if string ends with certain pattern
I am trying to display a string that is
Starts with anything
Has the letters ".mp4" in it
Ends explicitly with ', (apostrophe followed by comma)
Here is my Java code:
import java.util.*;
import java.lang.*;
import java.io.*;
import java.util.regex.*;
class Ideone
{
public static void main (String[] args) throws java.lang.Exception
{
// your code goes here
String str = " _file='ANyTypEofSTR1ngHere_133444556_266545797_10798866.mp4',";
Pattern p = Pattern.compile(".*.mp4[',]$");
Matcher m = p.matcher(str);
if(m.find())
System.out.println("yes");
else
System.out.println("no");
}
}
It prints "no". How should I declare my RegEx?
There are several issues in your regex:
"Has the letters .mp4 in it" means somewhere, not necessarily just in front of ',, so another .* should be inserted.
. matches any character. Use \. to match .
[,'] is a character group, i.e. exactly one of the characters in the brackets has to occur.
You can use the following regex instead:
Pattern p = Pattern.compile(".*\\.mp4.*',$");
Your character set [',] is checking whether the string ends with ' or , a single time.
If you want to match those character one or more times, use [',]+. However, you probably don't want to use a character set in this case since you said order is important.
To match an apostrophe followed by comma, just use:
.*\\.mp4',$
Also, since . has special meaning, you need to escape it in '.mp4'.

regex last word in a sentence ending with punctuation (period)

I'm looking for the regex pattern, not the Java code, to match the last word in an English (or European language) sentence. If the last word is, in this case, "hi" then I want to match "hi" and not "hi."
The regex (\w+)\.$ will match "hi.", whereas the output should be just "hi". What's the correct regex?
thufir#dur:~/NetBeansProjects/regex$
thufir#dur:~/NetBeansProjects/regex$ java -jar dist/regex.jar
trying
a b cd efg hi
matches:
hi
trying
a b cd efg hi.
matches:
thufir#dur:~/NetBeansProjects/regex$
code:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)$");
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
System.out.println(match);
}
}
}
My code is in Java, but that's neither here nor there. I'm strictly looking for the regex, not the Java code. (Yes, I know it's possible to strip out the last character with Java.)
What regex should I put in the pattern?
You can use lookahead asserion. For example to match sentence without period:
[\w\s]+(?=\.)
and
[\w]+(?=\.)
For just last word (word before ".")
If you need to have the whole match be the last word you can use lookahead.
\w+(?=(\.))
This matches a set of word characters that are followed by a period, without matching the period.
If you want the last word in the line, regardless of wether the line ends on the end of a sentence or not you can use:
\w+(?=(\.?$))
Or if you want to also include ,!;: etc then
\w+(?=(\p{Punct}?$))
You can use matcher.group(1) to get the content of the first capturing group ((\w+) in your case). To say a little more, matcher.group(0) would return you the full match. So your regex is almost correct. An improvement is related to your use of $, which would catch the end of the line. Use this only if your sentence fill exactly the line!
With this regular expression (\w+)\p{Punct} you get a group count of 1, means you get one group with punctionation at matcher.group(0) and one without the punctuation at matcher.group(1).
To write the regular expression in Java, use: "(\\w+)\\p{Punct}"
To test your regular expressions online with Java (and actually a lot of other languages) see RegexPlanet
By using the $ operator you will only get a match at the end of a line. So if you have multiple sentences on one line you will not get a match in the middle one.
So you should just use:
(\w+)\.
the capture group will give the correct match.
You can see an example here
I don't understand why really, but this works:
package regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String matchesLastWordFine = "a b cd efg hi";
lastWord(matchesLastWordFine);
String noMatchFound = matchesLastWordFine + ".";
lastWord(noMatchFound);
}
private static void lastWord(String sentence) {
System.out.println("\n\ntrying\n" + sentence + "\nmatches:");
Pattern pattern = Pattern.compile("(\\w+)"); //(\w+)\.
Matcher matcher = pattern.matcher(sentence);
String match = null;
while (matcher.find()) {
match = matcher.group();
}
System.out.println(match);
}
}
I guess regex \w+ will match all the words (doh). Then the last word is what I was after. Too simple, really, I was trying to exclude punctuation, but I guess regex does that automagically for you..?

Categories