Regex for detecting repeating symbols - java

I'm looking for the regex expression that will detect repeating symbols in a String. And currently I didn't found solution that fits all my requirements.
Requirements are pretty simple:
detect any repeating symbol in a String;
to be able to setup repeating count (eg. more than twice)
Examples of required detection (of symbol 'a', more than 2 times, true if detects, false otherwise)
"Abcdefg" - false
"AbcdaBCD" - false
"abcd_ab_ab" - true (symbol 'a' used three times)
"aabbaabb" - true (symbols 'a' used four times)
Since I'm not a pro in regex and usage of them - code snippet and explanation would be appreciated!
Thanks!

I think that
(.).*\1
would work:
(.) match a single character and capture
.* match any intervening characters
\1 match the captured group again.
(You'd need to compile with the DOTALL flag, or replace . with [\s\S] or similar if the string contains characters not ordinarily matched by .)
and if you want to require that it is found at least 3 times, just change the quantifier of the second two bullets:
(.)(.*\1){2}
etc.
This is going to be pretty inefficient, though, because it's going to have to do the "search for the next matching character" between every character in the string and the end of the string, making it at least quadratic.
You might be as well off not using regular expressions, e.g.
char[] cs = str.toCharArray();
Arrays.sort(cs);
int n = numOccurrencesRequired - 1;
for (int i = n; i < cs.length; ++i) {
boolean allSame = true;
for (int j = 1; j <= n && allSame; ++j) {
allSame = cs[i] == cs[i - j];
}
if (allSame) return true;
}
return false;
This sorts all of the same characters together, allowing you just to pass over the string once looking for adjacent equal characters.
Note that this doesn't quite work for any symbol: it will split up multi-char codepoints like 🍕. You can adapt the code above to work with codepoints, rather than chars.

Try this regex: (.)(?:.*\1)
It basically matches any character (.) is followed by anything .* and itself \1. If you want to check for 2 or more repeats only add {n,} at the end with n being the number of repeats you want to check for.

Yea, such regex exists but just because the set of characters is finite.
regex: .*(a.*a|b.*b|c.*c|...|y.*y|z.*z).*
It makes no sense. Use another approach:
String string = "something";
int[] count = new int[256];
for (int i = 0; i < string.length; i++) {
int temp = int(string.charAt(i));
count[temp]++;
}
Now you have all characters counted and you can use them as you wish.

Related

trouble with writing regex java

String always consists of two distinct alternating characters. For example, if string 's two distinct characters are x and y, then t could be xyxyx or yxyxy but not xxyy or xyyx.
But a.matches() always returns false and output becomes 0. Help me understand what's wrong here.
public static int check(String a) {
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
if(a.matches("["+on+"("+to+""+on+")*]|["+to+"("+on+""+to+")*]")) {
return a.length();
}
}
return 0;
}
Use regex (.)(.)(?:\1\2)*\1?.
(.) Match any character, and capture it as group 1
(.) Match any character, and capture it as group 2
\1 Match the same characters as was captured in group 1
\2 Match the same characters as was captured in group 2
(?:\1\2)* Match 0 or more pairs of group 1+2
\1? Optionally match a dangling group 1
Input must be at least two characters long. Empty string and one-character string will not match.
As java code, that would be:
if (a.matches("(.)(.)(?:\\1\\2)*\\1?")) {
See regex101.com for working examples1.
1) Note that regex101 requires use of ^ and $, which are implied by the matches() method. It also requires use of flags g and m to showcase multiple examples at the same time.
UPDATE
As pointed out by Austin Anderson:
fails on yyyyyyyyy or xxxxxx
To prevent that, we can add a zero-width negative lookahead, to ensure input doesn't start with two of the same character:
(?!(.)\1)(.)(.)(?:\2\3)*\2?
See regex101.com.
Or you can use Austin Anderson's simpler version:
(.)(?!\1)(.)(?:\1\2)*\1?
Actually your regex is almost correct but problem is that you have enclosed your regex in 2 character classes and you need to match an optional 2nd character in the end.
You just need to use this regex:
public static int check(String a) {
if (a.length() < 2)
return 0;
char on = a.charAt(0);
char to = a.charAt(1);
if(on != to) {
String re = on+"("+to+on+")*"+to+"?|"+to+"("+on+to+")*"+on+"?";
System.out.println("re: " + re);
if(a.matches(re)) {
return a.length();
}
}
return 0;
}
Code Demo

Counting comma and any text in java String

I'm trying to write a function to count specific Strings.
The Strings to count look like the following:
first any character except comma at least once -
the comma -
any chracter but at least once
example string:
test, test, test,
should count to 3
I've tried do that by doing the following:
int countSubstrings = 0;
final Pattern pattern = Pattern.compile("[^,]*,.+");
final Matcher matcher = pattern.matcher(commaString);
while (matcher.find()) {
countSubstrings++;
}
Though my solution doesn't work. It always ends up counting to one and no further.
Try this pattern instead: [^,]+
As you can see in the API, find() will give you the next subsequence that matches the pattern. So this will find your sequences of "non-commas" one after the other.
Your regex, especially the .+ part will match any char sequence of at least length 1. You want the match to be reluctant/lazy so add a ?: [^,]*,.+?
Note that .+? will still match a comma that directly follows a comma so you might want to replace .+? with [^,]+ instead (since commas can't match with this lazyness is not needed).
Besides that an easier solution might be to split the string and get the length of the array (or loop and check the elements if you don't want to allow for empty strings):
countSubstrings = commaString.split(",").length;
Edit:
Since you added an example that clarifies your expectations, you need to adjust your regex. You seem to want to count the number of strings followed by a comma so your regex can be simplified to [^,]+,. This matches any char sequence consisting of non-comma chars which is followed by a comma.
Note that this wouldn't match multiple commas or text at the end of the input, e.g. test,,test would result in a count of 1. If you have that requirement you need to adjust your regex.
So, quite good answers are already given. Very readable. Something like this should work, beware, it's not clean and probably not the fastest way to do this. But is is quite readable. :)
public int countComma(String lots_of_words) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == ',') {
count++;
}
}
return count;
}
Or even better:
public int countChar(String lots_of_words, char the_chosen_char) {
int count = 0;
for (int x = 0; x < lots_of_words.length(); x++) {
if (lots_of_words.charAt(x) == the_chosen_char) {
count++;
}
}
return count;
}

Java Regex hung on a long string

I am trying to write a REGEX to validate a string. It should validate to the requirement which is that it should have only Uppercase and lowercase English letters (a to z, A to Z) (ASCII: 65 to 90, 97 to 122) AND/OR Digits 0 to 9 (ASCII: 48 to 57) AND Characters - _ ~ (ASCII: 45, 95, 126). Provided that they are not the first or last character. It can also have Character. (dot, period, full stop) (ASCII: 46) Provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively. I have tried using the following
Pattern.compile("^[^\\W_*]+((\\.?[\\w\\~-]+)*\\.?[^\\W_*])*$");
It works fine for smaller strings but it doesn't for long strings as i am experiencing thread hung issues and huge spikes in cpu. Please help.
Test cases for invalid strings:
"aB78."
"aB78..ab"
"aB78,1"
"aB78 abc"
".Abc12"
Test cases for valid strings:
"abc-def"
"a1b2c~3"
"012_345"
Your regex suffers from catastrophic backtracking, which leads to O(2n) (ie exponential) solution time.
Although following the link will provide a far more thorough explanation, briefly the problem is that when the input doesn't match, the engine backtracks the first * term to try different combinations of the quantitys of the terms, but because all groups more or less match the same thing, the number of combinations of ways to group grows exponentially with the length of the backtracking - which in the case of non- matching input is the entire input.
The solution is to rewrite the regex so it won't catastrophically backtrack:
don't use groups of groups
use possessive quantifiers eg .*+ (which never backtrack)
fail early on non-match (eg using an anchored negative look ahead)
limit the number of times terms may appear using {n,m} style quantifiers
Or otherwise mitigate the problem
Problem
It is due to catastrophic backtracking. Let me show where it happens, by simplifying the regex to a regex which matches a subset of the original regex:
^[^\W_*]+((\.?[\w\~-]+)*\.?[^\W_*])*$
Since [^\W_*] and [\w\~-] can match [a-z], let us replace them with [a-z]:
^[a-z]+((\.?[a-z]+)*\.?[a-z])*$
Since \.? are optional, let us remove them:
^[a-z]+(([a-z]+)*[a-z])*$
You can see ([a-z]+)*, which is the classical example of regex which causes catastrophic backtracking (A*)*, and the fact that the outermost repetition (([a-z]+)*[a-z])* can expand to ([a-z]+)*[a-z]([a-z]+)*[a-z]([a-z]+)*[a-z] further exacerbate the problem (imagine the number of permutation to split the input string to match all expansions that your regex can have). And this is not mentioning [a-z]+ in front, which adds insult to injury, since it is of the form A*A*.
Solution
You can use this regex to validate the string according to your conditions:
^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$
As Java string literal:
"^(?=[a-zA-Z0-9])[a-zA-Z0-9_~-]++(\\.[a-zA-Z0-9_~-]++)*+(?<=[a-zA-Z0-9])$"
Breakdown of the regex:
^ # Assert beginning of the string
(?=[a-zA-Z0-9]) # Must start with alphanumeric, no special
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
(?<=[a-zA-Z0-9]) # Must end with alphanumeric, no special
$ # Assert end of the string
Since . can't appear consecutively, and can't start or end the string, we can consider it a separator between strings of [a-zA-Z0-9_~-]+. So we can write:
[a-zA-Z0-9_~-]++(\.[a-zA-Z0-9_~-]++)*+
All quantifiers are made possessive to reduce stack usage in Oracle's implementation and make the matching faster. Note that it is not appropriate to use them everywhere. Due to the way my regex is written, there is only one way to match a particular string to begin with, even without possessive quantifier.
Shorthand
Since this is Java and in default mode, you can shorten a-zA-Z0-9_ to \w and [a-zA-Z0-9] to [^\W_] (though the second one is a bit hard for other programmer to read):
^(?=[^\W_])[\w~-]++(\.[\w~-]++)*+(?<=[^\W_])$
As Java string literal:
"^(?=[^\\W_])[\\w~-]++(\\.[\\w~-]++)*+(?<=[^\\W_])$"
If you use the regex with String.matches(), the anchors ^ and $ can be removed.
As #MarounMaroun already commented, you don't really have a pattern. It might be better to iterate over the string as in the following method:
public static boolean validate(String string) {
char chars[] = string.toCharArray();
if (!isSpecial(chars[0]) && !isLetterOrDigit(chars[0]))
return false;
if (!isSpecial(chars[chars.length - 1])
&& !isLetterOrDigit(chars[chars.length - 1]))
return false;
for (int i = 1; i < chars.length - 1; ++i)
if (!isPunctiation(chars[i]) && !isLetterOrDigit(chars[i])
&& !isSpecial(chars[i]))
return false;
return true;
}
public static boolean isPunctiation(char c) {
return c == '.' || c == ',';
}
public static boolean isSpecial(char c) {
return c == '-' || c == '_' || c == '~';
}
public static boolean isLetterOrDigit(char c) {
return (Character.isDigit(c) || (Character.isLetter(c) && (Character
.getType(c) == Character.UPPERCASE_LETTER || Character
.getType(c) == Character.LOWERCASE_LETTER)));
}
Test code:
public static void main(String[] args) {
System.out.println(validate("aB78."));
System.out.println(validate("aB78..ab "));
System.out.println(validate("abcdef"));
System.out.println(validate("aB78,1"));
System.out.println(validate("aB78 abc"));
}
Output:
false
false
true
true
false
A solution should try and find negatives rather than try and match a pattern over the entire string.
Pattern bad = Pattern.compile( "[^-\\W.~]|\\.\\.|^\\.|\\.$" );
for( String str: new String[]{ "aB78.", "aB78..ab", "abcdef",
"aB78,1", "aB78 abc" } ){
Matcher mat = bad.matcher( str );
System.out.println( mat.find() );
}
(It is remarkable to see how the initial statement "string...should have only" leads programmers to try and create positive assertions by parsing or matching valid characters over the full length rather than the much simpler search for negatives.)

java regular expression examples for match without length limitation

i trying to write a regular expression for match a string starting with letter "G" and second index should be any number (0-9) and rest of the string can be contain any thing and can be any length,
i'm stuck in following code
String[] array = { "DA4545", "G121", "G8756942", "N45", "4578", "#45565" };
String regExp = "^[G]\\d[0-9]";
for(int i = 0; i < array.length; i++)
{
if(Pattern.matches(regExp, array[i]))
{
System.out.println(array[i] + " - Successful");
}
}
output:
G12 - Successful
why is not match the 3 index "G8756942"
G - the letter G
[0-9] - a digit
.* - any sequence of characters
So the expression
G[0-9].*
will match a letter G followed by a digit followed by any sequence of characters.
when you write \d it already means [0-9]
so when you say \d[0-9] that means two digits exactly
better use :
^G\\d*
which will match all words starting with G and having zero or more digits
"^[G]\\d[0-9]"
This regex matches "G" followed by \\d, then another number.
Use one of these:
"^G\\d"
"^G[0-9]"
Also note that you don't need a character class since it only contains one letter, so it's redundant.
try this regex .* will match any character after digit
^G\\d.*
http://regex101.com/r/uE4tX1/1
why is not match the 3 index "G8756942"
Because you match for a string starting with G, followed by a \, a d and exactly one digit. Solution:
^[G]\d
This regex would be fine.
"G\\d.*"
Because matches method tries to match the whole input, you need to add .* at the last in your pattern and also you don't need to include anchors.
String[] array = { "DA4545", "G121", "G8756942", "N45", "4578", "#45565" };
String regExp = "G\\d.*";
for(int i = 0; i < array.length; i++)
{
if(Pattern.matches(regExp, array[i]))
{
System.out.println(array[i] + " - Successful");
}
}
Output:
G121 - Successful
G8756942 - Successful

split a string in java into equal length substrings while maintaining word boundaries

How to split a string into equal parts of maximum character length while maintaining word boundaries?
Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me
"hello "
and
"world"
But my current implementation returns
"hello w"
and
"orld "
I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts
public static List<String> splitEqually(String text, int size) {
// Give the list the right capacity to start with. You could use an array
// instead if you wanted.
List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);
for (int start = 0; start < text.length(); start += size) {
ret.add(text.substring(start, Math.min(text.length(), start + size)));
}
return ret;
}
Will it be possible to maintain word boundaries while splitting the string into substring?
To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.
If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)
String data = "Hello there, my name is not importnant right now."
+ " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group(1));
Output:
Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.
Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:
(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)
\G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
\s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
(.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
. represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
{1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
.{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
(?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:
space (\\s)
OR (written as |)
end of the string $ after it.
So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).
Non-regex solution, just in case someone is more comfortable (?) not using regular expressions:
private String justify(String s, int limit) {
StringBuilder justifiedText = new StringBuilder();
StringBuilder justifiedLine = new StringBuilder();
String[] words = s.split(" ");
for (int i = 0; i < words.length; i++) {
justifiedLine.append(words[i]).append(" ");
if (i+1 == words.length || justifiedLine.length() + words[i+1].length() > limit) {
justifiedLine.deleteCharAt(justifiedLine.length() - 1);
justifiedText.append(justifiedLine.toString()).append(System.lineSeparator());
justifiedLine = new StringBuilder();
}
}
return justifiedText.toString();
}
Test:
String text = "Long sentence with spaces, and punctuation too. And supercalifragilisticexpialidocious words. No carriage returns, tho -- since it would seem weird to count the words in a new line as part of the previous paragraph's length.";
System.out.println(justify(text, 15));
Output:
Long sentence
with spaces,
and punctuation
too. And
supercalifragilisticexpialidocious
words. No
carriage
returns, tho --
since it would
seem weird to
count the words
in a new line
as part of the
previous
paragraph's
length.
It takes into account words that are longer than the set limit, so it doesn't skip them (unlike the regex version which just stops processing when it finds supercalifragilisticexpialidosus).
PS: The comment about all input words being expected to be shorter than the set limit, was made after I came up with this solution ;)

Categories