Count regex matches with streams - java

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :
final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");
There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :
// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();
// count is 0, wrong
final long count = pattern.splitAsStream("1").count();
I also tried (ab)using an IntStream. The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.
final long count = IntStream
.iterate(0, i -> matcher.find() ? 1 : 0)
.limit(100)
.sum();
I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?

To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.
final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();

The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.
The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.
If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas, not the numbers.
System.out.println("[" + pattern.splitAsStream("1,2,3,4")
.collect(Collectors.joining("!")) + "]");
prints [!,!,!,]. The odd bit is why it is giving you 4 and not 3.
Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.
A quick demo:
private void test(Pattern pattern, String s) {
System.out.println(s + "-[" + pattern.splitAsStream(s)
.collect(Collectors.joining("!")) + "]");
}
public void test() {
final Pattern pattern = Pattern.compile("\\d+");
test(pattern, "1,2,3,4");
test(pattern, "a1b2c3d4e");
test(pattern, "1");
}
prints
1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]

You can extend AbstractSpliterator to solve this:
static class SpliterMatcher extends AbstractSpliterator<Integer> {
private final Matcher m;
public SpliterMatcher(Matcher m) {
super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
this.m = m;
}
#Override
public boolean tryAdvance(Consumer<? super Integer> action) {
boolean found = m.find();
if (found)
action.accept(m.groupCount());
return found;
}
}
final Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1
matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4
matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0

Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?
final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
.filter(str -> str.matches("\\d+"))
.count();
where first line can be another way to stream List<String>().stream(), ...
Am I wrong ?

Java 9
You may use Matcher#results() to get hold of all matches:
Stream<MatchResult>    results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.
Java 8 and lower
Another simple solution based on using a reverse pattern:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split).
See this demo:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

Related

How to Determine if a String starts with exact number of zeros?

How can I know if my string exactly starts with {n} number of leading zeros?
For example below, the conditions would return true but my real intention is to check if the string actually starts with only 2 zeros.
String str = "00063350449370"
if (str.startsWith("00")) { // true
...
}
You can do something like:
if ( str.startsWith("00") && ! str.startsWith("000") ) {
// ..
}
This will make sure that the string starts with "00", but not a longer string of zeros.
You can try this regex
boolean res = s.matches("00[^0]*");
How about?
final String zeroes = "00";
final String zeroesLength = zeroes.length();
str.startsWith(zeroes) && (str.length() == zeroes.length() || str.charAt(zeroes.length()) != '0')
Slow but:
if (str.matches("(?s)0{3}([^0].*)?") {
This uses (?s) DOTALL option to let . also match line-breaks.
0{3} is for 3 matches.
How about using a regular expression?
0{n}[^0]*
where n is the number of leading '0's you want. You can utilise the Java regex API to check if the input matches the expression:
Pattern pattern = Pattern.compile("0{2}[^0]*"); // n = 2 here
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
// code
}
You can use a regular expression to evaluate the String value:
String str = "00063350449370";
String pattern = "[0]{2}[1-9]{1}[0-9]*"; // [0]{2}[1-9]{1} starts with 2 zeros, followed by a non-zero value, and maybe some other numbers: [0-9]*
if (Pattern.matches(pattern, str))
{
// DO SOMETHING
}
There might be a better regular expression to resolve this, but this should give you a general idea how to proceed if you choose the regular expression path.
The long way
String TestString = "0000123";
Pattern p = Pattern.compile("\\A0+(?=\\d)");
Matcher matcher = p.matcher(TestString);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(" Group: " + matcher.group());
}
Your probably better off with a small for loop though
int leadZeroes;
for (leadZeroes=0; leadZeroes<TestString.length(); leadZeroes++)
if (TestString.charAt(leadZeroes) != '0')
break;
System.out.println("Count of Leading Zeroes: " + leadZeroes);

Java Regex finding operators

I'm trying to use regex to get numbers and operators from a string containing an expression. It finds the numbers but i doesn't find the operators. After every match (number or operator) at the beginning of the string it truncates the expression in order to find the next one.
String expression = "23*12+11";
Pattern intPattern;
Pattern opPattern;
Matcher intMatch;
Matcher opMatch;
intPattern = Pattern.compile("^\\d+");
intMatch = intPattern.matcher(expression);
opPattern = Pattern.compile("^[-+*/()]+");
opMatch = opPattern.matcher(expression);
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
if (intMatch.find()) {
String inputInt = intMatch.group();
System.out.println(inputInt);
System.out.println("Found at index: " + intMatch.start());
expression = expression.substring(intMatch.end());
intMatch = intPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else if (opMatch.find()) {
String nextOp = opMatch.group();
System.out.println(nextOp);
System.out.println("Found at index: " + opMatch.start());
System.out.println("End index: " + opMatch.end());
expression = expression.substring(opMatch.end());
opMatch = opPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else {
System.out.println("Last item: " + expression);
break;
}
}
The output is
New expression: 23*12+11
23
Found at index: 0
Truncated expression: *12+11
New expression: *12+11
Last item: *12+11
As far as I have been able to investigate there is no need to escape the special characters *, + since they are inside a character class. What's the problem here?
First, your debugging output is confusing, because it's exactly the same in both branches. Add something to distinguish them, such as an a and b prefix:
System.out.println("a.Found at index: " + intMatch.start());
Your problem is that you're not resetting both matchers to the updated string. At the end of both branches in your if-else (or just once, after the entire if-else block), you need to do this:
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
One last thing: Since you're creating a new matcher over and over again via Pattern.matcher(s), you might want to consider creating the matcher only once, with a dummy-string, at the top of your code
//"": Unused string so matcher object can be reused
intMatch = Pattern.compile(...).matcher("");
and then resetting it in each loop iteration
intMatch.reset(expression);
You can implement the reusable Matchers like this:
//"": Unused to-search strings, so the matcher objects can be reused.
Matcher intMatch = Pattern.compile("^\\d+").matcher("");
Matcher opMatch = Pattern.compile("^[-+*/()]+").matcher("");
String expression = "23*12+11";
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
intMatch.reset(expression);
opMatch.reset(expression);
if(intMatch.find()) {
...
The
Pattern *Pattern = ...
lines can be removed from the top, and the
*Match = *Pattern.matcher(expression)
lines can be removed from both if-else branches.
Your main problem is that when you found int you or operator you are reassigning only intMatch or opMatch. So if you find int operator is still try to find match on old version of expression. So you need to place this lines in both your positive cases
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
But maybe instead of your approach with two Patterns and recreating expression just use one regex which will find ints or operators and place them in different group categories? I mean something like
String expression = "23*12+11";
Pattern p = Pattern.compile("(\\d+)|([-+*/()]+)");
Matcher m = p.matcher(expression);
while (m.find()){
if (m.group(1)==null){//group 1 is null so match must come from group 2
System.out.println("opperator found: "+m.group(2));
}else{
System.out.println("integer found: "+m.group(1));
}
}
Also if you don't need to separately handle integers and operators you can just split on places before and after operators using look-around mechanisms
String expression = "23*12+11";
for (String s : expression.split("(?<=[-+*/()])|(?=[-+*/()])"))
System.out.println(s);
Output:
23
*
12
+
11
Try this one
Note:You have missed modulus % operator
String expression = "2/3*1%(2+11)";
Pattern pt = Pattern.compile("[-+*/()%]");
Matcher mt = pt.matcher(expression);
int lastStart = 0;
while (mt.find()) {
if (lastStart != mt.start()) {
System.out.println("number:" + expression.substring(lastStart, mt.start()));
}
lastStart = mt.start() + 1;
System.out.println("operator:" + mt.group());
}
if (lastStart != expression.length()) {
System.out.println("number:" + expression.substring(lastStart));
}
output
number:2
operator:/
number:3
operator:*
number:1
operator:%
operator:(
number:2
operator:+
number:11
operator:)

Getting the "context" text of a matched group

I'm using the Matcher class of Java to get some strings, now when I get my matches, I also find their begin index and end index. Now what I want to do is get the x preceding and proceeding characters.
So what I did was just call the substring method on the string with {begin index minusx} to {end index plusx}, but it seems to be a little heavy, for every match, I'll have to loop the string for it's context.
I wanted to know whether there's a better way to do that.
Here is what I've done so far:
The part that bothers me is the text.substring, how expensive is it
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("\\d{2}").matcher(text);
int x = 5;
while (matcher.find()) {
String match = matcher.group();
int start = matcher.start();
int end = matcher.end();
String pretext = text.substring(start - x, start);
String postext = text.substring(end, end + x);
System.out.println(pretext + " - " + match + " - " + postext);
}
Suggested answer of using grouping to solve this:
using the regex (.{5})(\d{2}(.{5}).
First of all, this wouldn't be able to captures ones without at least 5 characters before it. So the solution to that is (.{0,5})(\d{2})(.{0.5}), very nice for that simple regex (\d{2})but for one like "c?at" and the given text "cat" this would match the groups
c
at
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("(.{5})(\\d{2})(.{5})").matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2) + " - " + matcher.group(3));
}
output :
Some - 22 - text
with - 44 - char

regex - disallowing a sequence of characters

I have a variable in the form of {varName} or {varName, "defaultValue"} and I want a regex that will match it. varName is only alphanumeric and "_" (\w+), and the default value can be anything except the combination of "} which signifies the end of the variable. White space doesn't matter between the braces, comma, varName or defaultValue. So far the regex that I have come up with is
\{\s*(\w+)\s*(,\s*\"([^(\"\})]*)\"\s*)?\}
The problem is that the match ends at the first " OR } and not the combination, i.e. {hello, "world"} does match but {hello, "wor"ld"} or {hello, "wor}ld"}
Any idea how to solve this? In case this helps, I'm coding it using Java.
final Pattern p = Pattern.compile("\\{\\s*(\\w+)\\s*(,\\s*\"((?!\"\\}).*)\"\\s*)?\\}");
Matcher m1 = p.matcher("{hello, \"world\"}");
if (m1.matches()) {
System.out.println("var1:" + m1.group(1));
System.out.println("val1:" + m1.group(3));
}
Matcher m2 = p.matcher("{hello, \"wor}ld\"}");
if (m2.matches()) {
System.out.println("var2:" + m2.group(1));
System.out.println("val2:" + m2.group(3));
}
Matcher m3 = p.matcher("{hello, \"wor}\"ld\"}");
if (m3.matches()) {
System.out.println("var3:" + m3.group(1));
System.out.println("val3:" + m3.group(3));
}
/*output:
var1:hello
val1:world
var2:hello
val2:wor}ld
var3:hello
val3:wor}"ld */
I have found the solution to my own question, it's only fair to share. The following regex:
\{\s*(\w+)\s*(,\s*\"((.*?)\s*\")?\})
Will do the trick, stopping at the first sequence of '"}'
In Java, this will be (continuing previous answer's example):
final Pattern p = Pattern.compile("\\{\\s*(\\w+)\\s*(,\\s*\"(.*?)\\s*\"\\s*)?\\}");
Matcher m1 = p.matcher("{hello, \"world\"}");
if (m1.matches()) {
System.out.println("var1:" + m1.group(1));
System.out.println("val1:" + m1.group(3));
}
Matcher m2 = p.matcher("{hello, \"wor\"rld\"}\"}");
if (m2.matches()) {
System.out.println("var2:" + m2.group(1));
System.out.println("val2:" + m2.group(3));
}
/* Output
var1:hello
val1:world
var2:hello
val2:wor"rld"}
*/

Java and regular expression, substring

I'm am tottaly lost when coming to regular expressions.
I get generated strings like:
Your number is (123,456,789)
How can I filter out 123,456,789?
You can use this regex for extracting the number including the commas
\(([\d,]*)\)
The first captured group will have your match. Code will look like this
String subjectString = "Your number is (123,456,789)";
Pattern regex = Pattern.compile("\\(([\\d,]*)\\)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
String resultString = regexMatcher.group(1);
System.out.println(resultString);
}
Explanation of the regex
"\\(" + // Match the character “(” literally
"(" + // Match the regular expression below and capture its match into backreference number 1
"[\\d,]" + // Match a single character present in the list below
// A single digit 0..9
// The character “,”
"*" + // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
")" +
"\\)" // Match the character “)” literally
This will get you started http://www.regular-expressions.info/reference.html
String str="Your number is (123,456,789)";
str = str.replaceAll(".*\\((.*)\\).*","$1");
or you can make the replacement a bit faster by doing:
str = str.replaceAll(".*\\(([\\d,]*)\\).*","$1");
try
"\\(([^)]+)\\)"
or
int start = text.indexOf('(')+1;
int end = text.indexOf(')', start);
String num = text.substring(start, end);
private void showHowToUseRegex()
{
final Pattern MY_PATTERN = Pattern.compile("Your number is \\((\\d+),(\\d+),(\\d+)\\)");
final Matcher m = MY_PATTERN.matcher("Your number is (123,456,789)");
if (m.matches()) {
Log.d("xxx", "0:" + m.group(0));
Log.d("xxx", "1:" + m.group(1));
Log.d("xxx", "2:" + m.group(2));
Log.d("xxx", "3:" + m.group(3));
}
}
You'll see the first group is the whole string, and the next 3 groups are your numbers.
String str = "Your number is (123,456,789)";
str = new String(str.substring(16,str.length()-1));

Categories