capturing group of consecutive digits using regex - java

i'm trying to capture only the two 6's adjacent to each other and get how many times did it occur using regex like if we had 794234879669786694326666976 the answer should be 2 or if its 66666 it should be zero and so on i'm using the following code and captured it by this (66)* and using matcher.groupcount() to get how many times did it occur but its not working !!!
package me;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class blah {
public static void main(String[] args) {
// Define regex to find the word 'quick' or 'lazy' or 'dog'
String regex = "(66)*";
String text = "6678793346666786784966";
// Obtain the required matcher
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int match=0;
int groupCount = matcher.groupCount();
System.out.println("Number of group = " + groupCount);
// Find every match and print it
while (matcher.find()) {
match++;
}
System.out.println("count is "+match);
}
}

One approach here would be to use lookarounds to ensure that you match only islands of exactly two sixes:
String regex = "(?<!6)66(?!6)";
String text = "6678793346666786784966";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
This finds a count of two, for the input string you provided (the two matches being the 66 at the very start and end of the string).
The regex pattern uses two lookarounds to assert that what comes before the first 6 and after the second 6 are not other sixes:
(?<!6) assert that what precedes is NOT 6
66 match and consume two 6's
(?!6) assert that what follows is NOT 6

You need to use
String regex = "(?<!6)66(?!6)";
See the regex demo.
Details
(?<!6) - no 6 right before the current location
66 - 66 substring
(?!6) - no 6 right after the current location.
See the Java demo:
String regex = "(?<!6)66(?!6)";
String text = "6678793346666786784966";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int match=0;
while (matcher.find()) {
match++;
}
System.out.println("count is "+match); // => count is 2

This didn't take long to come up with. I like regular expressions but I don't use them unless really necessary. Here is one loop method that appears to work.
char TARGET = '6';
int GROUPSIZE = 2;
// String with random termination character that's not a TARGET
String s = "6678793346666786784966" + "z";
int consecutiveCount = 0;
int groupCount = 0;
for (char c : s.toCharArray()) {
if (c == TARGET) {
consecutiveCount++;
}
else {
// if current character is not a TARGET, update group count if
// consecutive count equals GROUPSIZE
if (consecutiveCount == GROUPSIZE) {
groupCount++;
}
// in any event, reset consecutive count
consecutiveCount = 0;
}
}
System.out.println(groupCount);

Related

How to replace multiple consecutive occurrences of a character with a maximum allowed number of occurences?

CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
String replace = "-";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
boolean isMatch = matcher.find();
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < content.length(); i++) {
while (matcher.find()) {
matcher.appendReplacement(buffer, replace);
}
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
In the above code content is input string,
I am trying to find repetitive occurrences from string and want to replace it with max no of occurrences
For Example
input -("abaaadccc",2)
output - "abaadcc"
here aaaand cccis replced by aa and cc as max allowed repitation is 2
In the above code, I found such occurrences and tried replacing them with -, it's working, But can someone help me How can I get current char and replace with allowed occurrences
i.e If aaa is found it is replaced by aa
or is there any alternative method w/o using regex?
You can declare the second group in a regex and use it as a replacement:
String result = "aaabbbccaaa".replaceAll("(([a-zA-Z])\\2)\\2+", "$1");
Here's how it works:
( first group - a character repeated two times
([a-zA-Z]) second group - a character
\2 a character repeated once
)
\2+ a character repeated at least once more
Thus, the first group captures a replacement string.
It isn't hard to extrapolate this solution for a different maximum value of allowed repeats:
String input = "aaaaabbcccccaaa";
int maxRepeats = 4;
String pattern = String.format("(([a-zA-Z])\\2{%s})\\2+", maxRepeats-1);
String result = input.replaceAll(pattern, "$1");
System.out.println(result); //aaaabbccccaaa
Since you defined a group in your regex, you can get the matching characters of this group by calling matcher.group(1). In your case it contains the first character from the repeating group so by appending it twice you get your expected result.
CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
System.out.println("found : "+matcher.start()+","+matcher.end()+":"+matcher.group(1));
matcher.appendReplacement(buffer, matcher.group(1)+matcher.group(1));
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
Output:
found : 0,3:a
found : 3,6:b
found : 8,11:a
aabbccaa

Find ALL matches of a regex pattern in Java - even overlapping ones [duplicate]

This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}
By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...
If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g
As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit

Java Pattern REGEX Where Not Matching

I'm trying to use the Java Pattern and Matcher to apply input checks. I have it working in a really basic format which I am happy with so far. It applies a REGEX to an argument and then loops through the matching characters.
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexUtil {
public static void main(String[] args) {
String argument;
Pattern pattern;
Matcher matcher;
argument = "#a1^b2";
pattern = Pattern.compile("[a-zA-Z]|[0-9]|\\s");
matcher = pattern.matcher(argument);
// find all matching characters
while(matcher.find()) {
System.out.println(matcher.group());
}
}
}
This is fine for extracting all the good characters, I get the output
a
1
b
2
Now I wanted to know if it's possible to do the same for any characters that don't match the REGEX so I get the output
#
^
Or better yet loop through it and get TRUE or FALSE flags for each index of the argument
false
true
true
false
true
true
I only know how to loop through with matcher.find(), any help would be greatly appreciated
You may add a |(.) alternative to your pattern (to match any char but a line break char) and check if Group 1 matched upon each match. If yes, output false, else, output true:
String argument = "#a1^b2";
Pattern pattern = Pattern.compile("[a-zA-Z]|[0-9]|\\s|(.)"); // or "[a-zA-Z0-9\\s]|(.)"
Matcher matcher = pattern.matcher(argument);
while(matcher.find()) { // find all matching characters
System.out.println(matcher.group(1) == null);
See the Java demo, output:
false
true
true
false
true
true
Note you do not need to use a Pattern.DOTALL here, because \s in your "whitelist" part of the pattern matches line breaks.
Why not simply removing all matching chars from your string, so you get only the non matching ones back:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexUtil {
public static void main(String[] args) {
String argument;
Pattern pattern;
Matcher matcher;
argument = "#a1^b2";
pattern = Pattern.compile("[a-zA-Z]|[0-9]|\\s");
matcher = pattern.matcher(argument);
// find all matching characters
while(matcher.find()) {
System.out.println(matcher.group());
argument = argument.replace(matcher.group(), "");
}
System.out.println("argument: " + argument);
}
}
You have to iterate over each char of the String and check one by one :
//for-each loop, shorter way
for (char c : argument.toCharArray()){
System.out.println(pattern.matcher(c + "").matches());
}
or
//classic for-i loop, with details
for (int i = 0; i < argument.length(); i++) {
String charAtI = argument.charAt(i) + "";
boolean doesMatch = pattern.matcher(charAtI).matches();
System.out.println(doesMatch);
}
Also, when you don't require it, you can do declaration and give a value at same time :
String argument = "#a1^b2";
Pattern pattern = Pattern.compile("[a-zA-Z]|[0-9]|\\s");
Track positions, and for each match, print the characters between last match and current match.
int pos = 0;
while (matcher.find()) {
for (int i = pos; i < matcher.start(); i++) {
System.out.println(argument.charAt(i));
}
pos = matcher.end();
}
// Print any trailing characters after last match.
for (int i = pos; i < argument.length(); i++) {
System.out.println(argument.charAt(i));
}
One solution is
String argument;
Pattern pattern;
Matcher matcher;
argument = "#a1^b2";
List<String> charList = Arrays.asList(argument.split(""));
pattern = Pattern.compile("[a-zA-Z]|[0-9]|\\s");
matcher = pattern.matcher(argument);
ArrayList<String> unmatchedCharList = new ArrayList<>();
// find all matching
while(matcher.find()) {
unmatchedCharList.add(matcher.group());
}
for(String charr : charList)
{
System.out.println(unmatchedCharList.contains(charr ));
}
Output
false
true
true
false
true
true

Issue with finding indices of multiple matches in String with regex

I'm attempting to find the indices of multiple matches in a String using Regex (test code below), for use with external libraries.
static String content = "a {non} b {1} c {1}";
static String inline = "\\{[0-9]\\}";
public static void getMatchIndices()
{
Pattern pattern = Pattern.compile(inline);
Matcher matcher = pattern.matcher(content)
while (matcher.find())
{
System.out.println(matcher.group());
Integer i = content.indexOf(matcher.group());
System.out.println(i);
}
}
OUTPUT:
{1}
10
{1}
10
It finds both groups, but returns an index of 10 for both. Any ideas?
From http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#indexOf(java.lang.String):
Returns the index within this string of the first occurrence of the specified substring.
Since both match the same thing ('{1}') the first occurrence is returned in both cases.
You probably want to use Matcher#start() to determine the start of your match.
You can do this with regexp. The following will find the locations in the string.
static String content = "a {non} b {1} c {1}";
static String inline = "\\{[0-9]\\}";
public static void getMatchIndices()
{
Pattern pattern = Pattern.compile(inline);
Matcher matcher = pattern.matcher(content);
int pos = 0;
while (matcher.find(pos)) {
int found = matcher.start();
System.out.println(found);
pos = found +1;
}
}

How to determine where a regex failed to match using Java APIs

I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.
I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.
Is this possible to do?
If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class LastMatch {
private static int indexOfLastMatch(Pattern pattern, String input) {
Matcher matcher = pattern.matcher(input);
for (int i = input.length(); i > 0; --i) {
Matcher region = matcher.region(0, i);
if (region.matches() || region.hitEnd()) {
return i;
}
}
return 0;
}
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
String[] samples = {
"*ABC",
"A1b*",
"AB12uv",
"AB12uv*",
"ABCDabc",
"ABC123X"
};
for (String sample : samples) {
int lastMatch = indexOfLastMatch(pattern, sample);
System.out.println(sample + ": last match at " + lastMatch);
}
}
}
The output of this class is:
*ABC: last match at 0
A1b*: last match at 3
AB12uv: last match at 6
AB12uv*: last match at 6
ABCDabc: last match at 4
ABC123X: last match at 6
You can take the string, and iterate over it, removing one more char from its end at every iteration, and then check for hitEnd():
int farthestPoint(Pattern pattern, String input) {
for (int i = input.length() - 1; i > 0; i--) {
Matcher matcher = pattern.matcher(input.substring(0, i));
if (!matcher.matches() && matcher.hitEnd()) {
return i;
}
}
return 0;
}
You could use a pair of replaceAll() calls to indicate the positive and negative matches of the input string. Let's say, for example, you want to validate a hex string; the following will indicate the valid and invalid characters of the input string.
String regex = "[0-9A-F]"
String input = "J900ZZAAFZ99X"
Pattern p = Pattern.compile(regex)
Matcher m = p.matcher(input)
String mask = m.replaceAll('+').replaceAll('[^+]', '-')
System.out.println(input)
System.out.println(mask)
This would print the following, with a + under valid characters and a - under invalid characters.
J900ZZAAFZ99X
-+++--+++-++-
If you want to do it outside of the code, I use rubular to test the regex expressions before sticking them in the code.

Categories