I need regular expressions to match the below cases.
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
I don't think you can (easily) use regex for the first case. The second case is easy though:
Pattern pattern = Pattern.compile("([a-z\\d])\\1\\1", Pattern.CASE_INSENSITIVE);
Since \\1 represents part matched by group 1 this will match any sequence of three identical characters that are either within the range a-z or are digits (\d).
Update
To be clear, you can use regex for the first case. However, the pattern is so laborious and ridiculously convoluted that you are better off not doing it at all. Especially if you wanted to REALLY cover all the alphabet. In that case you should probably generate the pattern programmatically by iterating the char codes of the Unicode charset or something like that and generate groupings for every three consecutive characters. However, you should realize that by having generated such a large decision tree for the pattern matcher, the marching performance is bound to suffer (O(n) where n is the number of groups which is the size of the Unicode charset minus 2).
I disagree, case 1 is possible to regex, but you have to tell it the sequences to match... which is kind of long and boring:
/(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+/ig
http://regexr.com/3dqln
for the second question:
\\b([a-zA-Z0-9])\\1\\1+\\b
explanation:
\\b : zero-length word boundary
( : start capture group 1
[a-zA-Z0-9] : a letter or a digit
) : end group
\\1 : same character as group 1
\\1+ : same character as group 1 one or more times
\\b : zero-length word boundary
To my knowledge, the first case is indeed not possible. The regex engine doesn't know anything about the order of the natural numbers or the alphabet. But it's at least possible to differentiate between 3 or more numbers and 3 or more letters, for example:
[a-z]{3,}|[A-Z]{3,}|\d{3,}
This matches abcd, ABCDE or 123 but doesn't match ab2d, A5c4 or 12z, for example. According to this, the second case can be correctly given in a shorter version as:
(\w)\1{2,}
3 or more consecutive sequential characters/numbers ex - 123, abc, 789, pqr etc.
Not possible with regular expressions.
3 or more consecutive identical characters/numbers ex - 111, aaa, bbb. 222 etc.
Use a pattern of (?i)(?:([a-z0-9])\\1{2,})*.
If you want to check the whole string, use Matcher.matches(). To find matches within a string, use Matcher.find().
Here's some sample code:
final String ps = "(?i)(?:([a-z0-9])\\1{2,})*";
final String psLong =
"(?i)\t\t\t# Case insensitive flag\n"
+ "(?:\t\t\t\t# Begin non-capturing group\n"
+ " (\t\t\t\t# Begin capturing group\n"
+ " [a-z0-9]\t\t# Match an alpha or digit character\n"
+ " )\t\t\t\t# End capturing group\n"
+ " \\1\t\t\t\t# Back-reference first capturing group\n"
+ " {2,}\t\t\t# Match previous atom 2 or more times\n"
+ ")\t\t\t\t# End non-capturing group\n"
+ "*\t\t\t\t# Match previous atom zero or more characters\n";
System.out.println("***** PATTERN *****\n" + ps + "\n" + psLong
+ "\n");
final Pattern p = Pattern.compile(ps);
for (final String s : new String[] {"aa", "11", "aaa", "111",
"aaaaaaaaa", "111111111", "aaa111bbb222ccc333",
"aaaaaa111111bbb222"})
{
final Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("Success: " + s);
} else {
System.out.println("Fail: " + s);
}
}
And the output is:
***** PATTERN *****
(?i)(?:([a-z0-9])\1{2,})*
(?i) # Case insensitive flag
(?: # Begin non-capturing group
( # Begin capturing group
[a-z0-9] # Match an alpha or digit character
) # End capturing group
\1 # Back-reference first capturing group
{2,} # Match previous atom 2 or more times
) # End non-capturing group
* # Match previous atom zero or more characters
Fail: aa
Fail: 11
Success: aaa
Success: 111
Success: aaaaaaaaa
Success: 111111111
Success: aaa111bbb222ccc333
Success: aaaaaa111111bbb222
Regex to match three consecutive numbers or alphabets is
"([0-9]|[aA-zZ])\1\1"
Thanks All for helping me.
For the first case - 3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc. I used below code logic. Pls share your comments on this.
public static boolean validateConsecutiveSeq(String epin) {
char epinCharArray[] = epin.toCharArray();
int asciiCode = 0;
boolean isConSeq = false;
int previousAsciiCode = 0;
int numSeqcount = 0;
for (int i = 0; i < epinCharArray.length; i++) {
asciiCode = epinCharArray[i];
if ((previousAsciiCode + 1) == asciiCode) {
numSeqcount++;
if (numSeqcount >= 2) {
isConSeq = true;
break;
}
} else {
numSeqcount = 0;
}
previousAsciiCode = asciiCode;
}
return isConSeq;
}
If you have lower bound (3) and upper bound regexString can be generated as follows
public class RegexBuilder {
public static void main(String[] args) {
StringBuilder sb = new StringBuilder();
int seqStart = 3;
int seqEnd = 5;
buildRegex(sb, seqStart, seqEnd);
System.out.println(sb);
}
private static void buildRegex(StringBuilder sb, int seqStart, int seqEnd) {
for (int i = seqStart; i <= seqEnd; i++) {
buildRegexCharGroup(sb, i, '0', '9');
buildRegexCharGroup(sb, i, 'A', 'Z');
buildRegexCharGroup(sb, i, 'a', 'z');
buildRegexRepeatedString(sb, i);
}
}
private static void buildRegexCharGroup(StringBuilder sb, int seqLength,
char start, char end) {
for (char c = start; c <= end - seqLength + 1; c++) {
char ch = c;
if (sb.length() > 0) {
sb.append('|');
}
for (int i = 0; i < seqLength; i++) {
sb.append(ch++);
}
}
}
private static void buildRegexRepeatedString(StringBuilder sb, int seqLength) {
sb.append('|');
sb.append("([a-zA-Z\\d])");
for (int i = 1; i < seqLength; i++) {
sb.append("\\1");
}
}
}
Output
012|123|234|345|456|567|678|789|ABC|BCD|CDE|DEF|EFG|FGH|GHI|HIJ|IJK|JKL|KLM|LMN|MNO|NOP|OPQ|PQR|QRS|RST|STU|TUV|UVW|VWX|WXY|XYZ|abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|([a-z\d])\1\1|0123|1234|2345|3456|4567|5678|6789|ABCD|BCDE|CDEF|DEFG|EFGH|FGHI|GHIJ|HIJK|IJKL|JKLM|KLMN|LMNO|MNOP|NOPQ|OPQR|PQRS|QRST|RSTU|STUV|TUVW|UVWX|VWXY|WXYZ|abcd|bcde|cdef|defg|efgh|fghi|ghij|hijk|ijkl|jklm|klmn|lmno|mnop|nopq|opqr|pqrs|qrst|rstu|stuv|tuvw|uvwx|vwxy|wxyz|([a-z\d])\1\1\1|01234|12345|23456|34567|45678|56789|ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP|MNOPQ|NOPQR|OPQRS|PQRST|QRSTU|RSTUV|STUVW|TUVWX|UVWXY|VWXYZ|abcde|bcdef|cdefg|defgh|efghi|fghij|ghijk|hijkl|ijklm|jklmn|klmno|lmnop|mnopq|nopqr|opqrs|pqrst|qrstu|rstuv|stuvw|tuvwx|uvwxy|vwxyz|([a-z\d])\1\1\1\1
All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n
This also works:
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[[:alpha:]])|([a-zA-Z0-9])\1\1+
https://regex101.com/r/6fXC9u/1
for the first question this works if you're ok with less regex
containsConsecutiveCharacters(str) {
for (let i = 0; i <= str.length - 3; i++) {
var allthree = str[i] + str[i + 1] + str[i + 2];
let s1 = str.charCodeAt(i);
let s2 = str.charCodeAt(i + 1);
let s3 = str.charCodeAt(i + 2);
if (
/[a-zA-Z]+$/.test(allthree) &&
(s1 < s2 && s2 < s3 && s1+s2+s3-(3*s1) === 3)
) {
return true;
}
}
}
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(?:(?:0(?=1)|1(?=2)|2(?=3)|3(?=4)|4(?=5)|5(?=6)|6(?=7)|7(?=8)|8(?=9)){2,}\d|(?:a(?=b)|b(?=c)|c(?=d)|d(?=e)|e(?=f)|f(?=g)|g(?=h)|h(?=i)|i(?=j)|j(?=k)|k(?=l)|l(?=m)|m(?=n)|n(?=o)|o(?=p)|p(?=q)|q(?=r)|r(?=s)|s(?=t)|t(?=u)|u(?=v)|v(?=w)|w(?=x)|x(?=y)|y(?=z)){2,}[\p{Alpha}])
https://regex101.com/r/5IragF/1
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([\p{Alnum}])\1{2,}
https://regex101.com/r/VEHoI9/1
All put together:
([a-zA-Z0-9])\1\1+|(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive sequential characters/numbers; e.g. 123, abc, 789, pqr, etc.
(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz|012|123|234|345|456|567|678|789)+
3 or more consecutive identical characters/numbers; e.g. 111, aaa, bbb, 222, etc.
([a-zA-Z0-9])\1\1+
https://regexr.com/4727n
For case #2 I got inspired by a sample on regextester and created the following regex to match n identical digits (to check for both numbers and letters replace 0-9 with A-Za-z0-9):
const n = 3
const identicalAlphanumericRegEx = new RegExp("([0-9])" + "\\1".repeat(n - 1))
I was discussing this with a coworker and we think we have a good solution for #1.
To check for abc or bcd or ... or 012 or 123 or even any number of sequential characters, try:
.*((a(?=b))|(?:b(?=c))|(?:c(?=d))|(?:d(?=e))|(?:e(?=f))|(?:f(?=g))|(?:g(?=h))|(?:h(?=i))|(?:i(?=j))|(?:j(?=k))|(?:k(?=l))|(?:l(?=m))|(?:m(?=n))|(?:n(?=o))|(?:o(?=p))|(?:p(?=q))|(?:q(?=r))|(?:r(?=s))|(?:s(?=t))|(?:t(?=u))|(?:u(?=v))|(?:v(?=w))|(?:w(?=x))|(?:x(?=y))|(?:y(?=z))|(?:0(?=1))|(?:1(?=2))|(?:2(?=3))|(?:3(?=4))|(?:4(?=5))|(?:5(?=6))|(?:6(?=7))|(?:7(?=8))|(?:8(?=9))){2,}.*
The nice thing about this solution is if you want more than 3 consecutive characters, increase the {2,} to be one less than what you want to check for.
the ?: in each group prevents the group from being captured.
Try this for the first question.
returns true if it finds 3 consecutive numbers or alphabets in the arg
function check(val){
for (i = 0; i <= val.length - 3; i++) {
var s1 = val.charCodeAt(i);
var s2 = val.charCodeAt(i + 1);
var s3 = val.charCodeAt(i + 2);
if (Math.abs(s1 - s2) === 1 && s1 - s2 === s2 - s3) {
return true;
}
}
return false;
}
console.log(check('Sh1ak#ki1r#100'));
Related
hey I need a regex that removes the leadings zeros.
right now I am using this code . it does work it just doesn't keep the negative symbol.
String regex = "^+(?!$)";
String numbers = txaTexte.getText().replaceAll(regex, ")
after that I split numbers so it puts the numbers in a array.
input :
-0005
0003
-87
output :
-5
3
-87
I was also wondering what regex I could use to get this.
the words before the arrow are input and after is the output
the text is in french. And right now I am using this it works but not with the apostrophe.
String [] tab = txaTexte.getText().split("(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+")
Un beau JOUR. —> Un/beau/JOUR
La boîte crânienne —> La/boîte/crânienne
C’était mieux aujourd’hui —> C’/était/mieux/aujourd’hui
qu’autrefois —> qu’/autrefois
D’hier jusqu’à demain! —> D’/hier/jusqu’/à/demain
Dans mon sous-sol—> Dans/mon/sous-sol
You might capture an optional hyphen, then match 1+ more times a zero and 1 capture 1 or more digits in group 2 starting with a digit 1-9
^(-?)0+([1-9]\d*)$
^ Start of string
(-?) Capture group 1, match optional hyphen
0+ Match 0+ zeroes
([1-9]\d*) Capture group 2, match 1+ digits starting with a digit 1-9
$ End of string
See a regex demo.
In the replacement use group 1 and group 2.
String regex = "^(-?)0+([1-9]\\d*)$";
String text = "-0005";
String numbers = txaTexte.getText().replaceAll(regex, "$1$2");
Here is one way. This preserves the sign.
capture the optional sign.
check for 0 or more leading zeros
followed by 1 or more digits.
String regex = "^([+-])?0*(\\d+)";
String [] data = {"-1415", "+2924", "-0000123", "000322", "+000023"};
for (String num : data) {
String after = num.replaceAll(regex, "$1$2");
System.out.printf("%8s --> %s%n", num , after);
}
prints
-1415 --> -1415
+2924 --> +2924
-0000123 --> -123
000322 --> 322
+000023 --> +23
If you want to keep -000, 000, +0000 etc. as just 0, try this regex:
`^[-+]?0*(0)$|^([-+])?0*(\d+)$`
Break down:
^...$ means the entire string should match (^ is the start of the string, $ is the end)
...|... is an alternative
[-+] is a character class that contains only the plus and minus characters. Note that - has a special meaning ("range") in character classes if it's not the first or last character
(...) is a capturing group which can be referenced in the replacement string by $number where number is the 1-based and 1-digit position of the group within the regex (the first group to start is no. 1 etc.)
?, * and + are quantifiers when used outside character classes meaning "0 or 1 occurence" (?), "any number of occurences, including none" (*) and "at least one occurence" (+)
^[-+]?0*(0)$ thus means: the entire string must be an optional sign, followed by any number of zeros and ending with a single zero which is captured as group 1.
alternatively ^([-+])?0*(\d+)$ means the entire string must be an optional sign which is captured as group 2, followed by any number of zeros and ending in at least one digit which is captured as group 3.
This regex can then be used with String.replaceAll(regex, "$1$2$3") in order to keep only the single 0 from group 1 or the optional sign and the number without leading zeros from groups 2 and 3. Any empty groups will result empty strings, that's why this works.
However, regular expressions can be slow, especially if you have to process a lot of strings.
One thing to improve this would be to compile the pattern only once:
//compile the pattern once and reuse it
Pattern p = Pattern.compile("^[+-]?0*(0)$|^([+-])?0*(\\d+)$");
//build a matcher from the pattern and the input string, and do the replacement
String number = p.matcher(txaTexte.getText()).replaceAll("$1$2$3");
If you're working on a large number of strings (> 10000) you might want to use some specialized plain parsing without regex. Consider something like this, which on my machine is about 10x faster than the regex approach with reused pattern:
public static String stripLeadingZeros(String s) {
//nothing to do, return the string as is
if( s == null || s.isEmpty() ) {
return s;
}
char[] chars = s.toCharArray();
int usedChars = 0;
//check if the first character is the sign
boolean hasSign = false;
if(chars[0] == '-' || chars[0] == '+') {
hasSign = true;
usedChars++;
//special case: just a sign
if(chars.length == 1) {
return s;
}
}
//process the rest of the characters
boolean stripZeros = true;
for( int i = usedChars; i < chars.length; i++) {
//not a digit, this isn't a simple integer, stop processing and keep the original string
if( chars[i] < '0' || chars[i] > '9') {
return s;
}
//are we still in zero-stripping mode
if( stripZeros) {
if( chars[i] == '0') {
continue; //check next char
}
//we've found a non-zero char, keep it and end zero-stripping mode
if(chars[i] >= '1' && chars[i] <= '9') {
stripZeros = false;
}
}
//since we are ignoring leading zeros, we just move all digits of the actual number to the left
chars[usedChars++] = chars[i];
}
//handle special case of number 0 (with optional sign)
if( usedChars == (hasSign ? 1 : 0)) {
chars[0] = '0';
usedChars = 1;
}
return new String(chars,0, usedChars);
}
I need to do a check for a 5-digit number using regex. Check condition: 1. There should be no more than 2 repeating digits (type 11234). 2. There should be no sequence 12345 or 54321.
I am trying to do this:
var PASSWORD_PATTERN = "^(?=[\\\\D]*\\\\d)(?!.*(\\\\d)\\\\1)(?!.*\\\\2{3,}){5,}.*$",
But checking for 12345 or 54321 doesn't work.
You can assert for not 3 of the same digits, and assert not 12345 and 54321.
Note to double escape the backslash in Java \\d.
^(?!\d*(\d)\d*\1\d*\1)(?!12345)(?!54321)\d{5}$
The pattern matches:
^ Start of string
(?!\d*(\d)\d*\1\d*\1) Negative lookahead, do not match 3 times the same digits using 2 backreferences \1
(?!12345) Assert not 12345
(?!54321) Assert not 54322
\d{5} Match 5 digits
$ End of string
Regex demo
Or fail the match immediately, if the string does not consists of 5 digits, and match 1+ digits if all the assertions succeed.
^(?=\d{5}$)(?!\d*(\d)\d*\1\d*\1)(?!12345)(?!54321)\d+$
Regex demo
If you don't want to match ascending and descending sequences for digits 0-9, you might either manually check the string for each hardcoded sequence, or generate the sequences and add them to a list.
Then you can check if the sequence of 5 digits is in the list, and remove the exact check with the lookarounds from the pattern.
List<String> sequences = new ArrayList<>();
for (int i = 0; i < 10; i++) {
StringBuilder sequence = new StringBuilder();
int last = i;
for (int j = 0; j < 5; j++) {
++last;
if (last > 9) last = 0;
sequence.append(last);
}
sequences.add(sequence.toString());
sequences.add(sequence.reverse().toString());
}
String[] strings = {"12345", "54321", "34567", "90123", "112341", "12356", "00132"};
for (String s : strings) {
if ((!sequences.contains(s)) && s.matches("^(?=\\d{5}$)(?!\\d*(\\d)\\d*\\1\\d*\\1)\\d+$")) {
System.out.printf("%s is not a sequence and does not contain 3 of the same digits\n", s);
}
}
Output
12356 is not a sequence and does not contain 3 of the same digits
00132 is not a sequence and does not contain 3 of the same digits
Java demo
I have a single string that contains several quotes, i.e:
"Bruce Wayne" "43" "male" "Gotham"
I want to create a method using regex that extracts certain values from the String based on their position.
So for example, if I pass the Int values 1 and 3 it should return a String of:
"Bruce Wayne" "male"
Please note the double quotes are part of the String and are escaped characters (\")
If the number of (possible) groups is known you could use a regular expression like "(.*?)"\s*"(.*?)"\s*"(.*?)"\s*"(.*?)" along with Pattern and Matcher and access the groups by number (group 0 will always be the first match, group 1 will be the first capturing group in the expression and so on).
If the number of groups is not known you could just use expression "(.*?)" and use Matcher#find() too apply the expression in a loop and collect all the matches (group 0 in that case) into a list. Then use your indices to access the list element (element 1 would be at index 0 then).
Another alternative would be to use string.replaceAll("^[^\"]*\"|\"[^\"]*$","").split("\"\\s*\""), i.e. remove the leading and trailing double quotes with any text before or after and then split on quotes with optional whitespace in between.
Example:
assume the string optional crap before "Bruce Wayne" "43" "male" "Gotham" optional crap after
string.replaceAll("^[^\"]*\"|\"[^\"]*$","") will result in Bruce Wayne" "43" "male" "Gotham
applying split("\"\\s*\"") on the result of the step before will yield the array [Bruce Wayne, 43, male, Gotham]
then just access the array elements by index (zero-based)
My function starts at 0. You said that you want 1 and 3 but usually you start at 0 when working with arrays. So to get "Bruce Wayne" you'd ask for 0 not 1. (you could change that if you'd like though)
String[] getParts(String text, int... positions) {
String results[] = new String[positions.length];
Matcher m = Pattern.compile("\"[^\"]*\"").matcher(text);
for(int i = 0, j = 0; m.find() && j < positions.length; i++) {
if(i != positions[j]) continue;
results[j] = m.group();
j++;
}
return results;
}
// Usage
public Test() {
String[] parts = getParts(" \"Bruce Wayne\" \"43\" \"male\" \"Gotham\" ", 0, 2);
System.out.println(Arrays.toString(parts));
// = ["Bruce Wayne", "male"]
}
The method accepts as many parameters as you like.
getParts(" \"a\" \"b\" \"c\" \"d\" ", 0, 2, 3); // = a, c, d
// or
getParts(" \"a\" \"b\" \"c\" \"d\" ", 3); // = d
The function to extract words based on position:
import java.util.ArrayList;
import java.util.regex.*;
public String getString(String input, int i, int j){
ArrayList <String> list = new ArrayList <String> ();
Matcher m = Pattern.compile("(\"[^\"]+\")").matcher(input);
while (m.find()) {
list.add(m.group(1));
}
return list.get(i - 1) + list.get(j - 1);
}
Then the words can be extracted like:
String input = "\"Bruce Wayne\" \"43\" \"male\" \"Gotham\"";
String res = getString(input, 1, 3);
System.out.println(res);
Output:
"Bruce Wayne""male"
I'm searching for an efficient way for a wildcard-enabled search in Java. My first approach was of course to use regex. However this approach does NOT find ALL possible matches!
Here's the code:
public static ArrayList<StringOccurrence> matchesWildcard(String string, String pattern, boolean printToConsole) {
Pattern p = Pattern.compile(normalizeWildcards(pattern));
Matcher m = p.matcher(string);
ArrayList<StringOccurrence> res = new ArrayList<StringOccurrence>();
int count = 0;
while (m.find()){
res.add(new StringOccurrence(m.start(), m.end(), count, m.group()));
if(printToConsole)
System.out.println(count + ") " + m.group() + ", " + m.start() + ", " + m.end());
count +=1;
}
return res;
For a query q: ab*b and a String str: abbccabbccbbb I get the output:
0) abb, 0, 3
1) abb, 5, 8
But the whole String should be also a result, because it matches the pattern. It seems that the Java-implementation of regex starts each new search after the last match...
Any ideas how this could work (or suggestions for frameworks...)?
If you really need all possible matches, this answer is not useful for you (anyway maybe other user finds it useful).
If the widest match would be sufficient for you, then use a greedy quantifier (I guess you're using a reluctant one, showing your pattern would be useful).
Google for greedy vs reluctant quantifiers for regex.
Cheers.
ab*b means "a" followed by zero or more "b" followed by a "b". The minimum match would be "ab". Soulds like you're looking for something like: a[a-z]*b where [a-z]* indicates zero or more of any lowercase letter. You may also want to bound it so that the start of the "word" must be an "a" and the end must be a "b": \ba[a-z]*b\b
You are expecting * to mean .* and .*? at the same time (and more).
You should reconsider what you really need. Let's extend your example:
abbccabbccbbbcabb
Do you really want all possibilities?
To achieve what you want you'll have to
iterate p1 over all occurrences of "ab"
from p1+2 on
iterate p2 over all occurrences of "b"
output substring between p1 and p2+1
This is the corresponding Java code:
public static void main( String[] args ){
String s = "abbccabbccbbb";
int f1 = 0;
int p1;
while( (p1 = s.indexOf( "ab", f1 )) >= 0 ){
int f2 = p1 + 2;
int p2;
while( (p2 = s.indexOf( "b", f2 )) >= 0 ){
System.out.println( s.substring( p1, p2 + 1 ) );
f2 = p2 + 1;
}
f1 = p1 + 2;
}
}
Below is the output. You may be surprised - maybe that's more than you expect, but then you'll need to refine your specification.
abb 0:3
abbccab 0:7
abbccabb 0:8
abbccabbccb 0:11
abbccabbccbb 0:12
abbccabbccbbb 0:13
abb 5:8
abbccb 5:11
abbccbb 5:12
abbccbbb 5:13
Later
Why is a single regular expression not capable of doing it?
The basic mechanism of pattern matching is to try and match the regex against a string, starting at some position, initially 0. If a match is found, this position is advanced according to the matched string. The pattern matcher never looks back.
A pattern ab.*?b will try and find the next 'b' after an "ab". This means that *no match is possible beginning with the same "ab" and ending at some 'b' following that previously found "next 'b'".
In other words: one regex cannot find overlapping substrings.
I have a String variable (basically an English sentence with an unspecified number of numbers) and I'd like to extract all the numbers into an array of integers. I was wondering whether there was a quick solution with regular expressions?
I used Sean's solution and changed it slightly:
LinkedList<String> numbers = new LinkedList<String>();
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(line);
while (m.find()) {
numbers.add(m.group());
}
Pattern p = Pattern.compile("-?\\d+");
Matcher m = p.matcher("There are more than -2 and less than 12 numbers here");
while (m.find()) {
System.out.println(m.group());
}
... prints -2 and 12.
-? matches a leading negative sign -- optionally. \d matches a digit, and we need to write \ as \\ in a Java String though. So, \d+ matches 1 or more digits.
What about to use replaceAll java.lang.String method:
String str = "qwerty-1qwerty-2 455 f0gfg 4";
str = str.replaceAll("[^-?0-9]+", " ");
System.out.println(Arrays.asList(str.trim().split(" ")));
Output:
[-1, -2, 455, 0, 4]
Description
[^-?0-9]+
[ and ] delimites a set of characters to be single matched, i.e., only one time in any order
^ Special identifier used in the beginning of the set, used to indicate to match all characters not present in the delimited set, instead of all characters present in the set.
+ Between one and unlimited times, as many times as possible, giving back as needed
-? One of the characters “-” and “?”
0-9 A character in the range between “0” and “9”
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher(myString);
while (m.find()) {
int n = Integer.parseInt(m.group());
// append n to list
}
// convert list to array, etc
You can actually replace [0-9] with \d, but that involves double backslash escaping, which makes it harder to read.
StringBuffer sBuffer = new StringBuffer();
Pattern p = Pattern.compile("[0-9]+.[0-9]*|[0-9]*.[0-9]+|[0-9]+");
Matcher m = p.matcher(str);
while (m.find()) {
sBuffer.append(m.group());
}
return sBuffer.toString();
This is for extracting numbers retaining the decimal
The accepted answer detects digits but does not detect formated numbers, e.g. 2,000, nor decimals, e.g. 4.8. For such use -?\\d+(,\\d+)*?\\.?\\d+?:
Pattern p = Pattern.compile("-?\\d+(,\\d+)*?\\.?\\d+?");
List<String> numbers = new ArrayList<String>();
Matcher m = p.matcher("Government has distributed 4.8 million textbooks to 2,000 schools");
while (m.find()) {
numbers.add(m.group());
}
System.out.println(numbers);
Output:
[4.8, 2,000]
Using Java 8, you can do:
String str = "There 0 are 1 some -2-34 -numbers 567 here 890 .";
int[] ints = Arrays.stream(str.replaceAll("-", " -").split("[^-\\d]+"))
.filter(s -> !s.matches("-?"))
.mapToInt(Integer::parseInt).toArray();
System.out.println(Arrays.toString(ints)); // prints [0, 1, -2, -34, 567, 890]
If you don't have negative numbers, you can get rid of the replaceAll (and use !s.isEmpty() in filter), as that's only to properly split something like 2-34 (this can also be handled purely with regex in split, but it's fairly complicated).
Arrays.stream turns our String[] into a Stream<String>.
filter gets rid of the leading and trailing empty strings as well as any - that isn't part of a number.
mapToInt(Integer::parseInt).toArray() calls parseInt on each String to give us an int[].
Alternatively, Java 9 has a Matcher.results method, which should allow for something like:
Pattern p = Pattern.compile("-?\\d+");
Matcher m = p.matcher("There 0 are 1 some -2-34 -numbers 567 here 890 .");
int[] ints = m.results().map(MatchResults::group).mapToInt(Integer::parseInt).toArray();
System.out.println(Arrays.toString(ints)); // prints [0, 1, -2, -34, 567, 890]
As it stands, neither of these is a big improvement over just looping over the results with Pattern / Matcher as shown in the other answers, but it should be simpler if you want to follow this up with more complex operations which are significantly simplified with the use of streams.
for rational numbers use this one: (([0-9]+.[0-9]*)|([0-9]*.[0-9]+)|([0-9]+))
Extract all real numbers using this.
public static ArrayList<Double> extractNumbersInOrder(String str){
str+='a';
double[] returnArray = new double[]{};
ArrayList<Double> list = new ArrayList<Double>();
String singleNum="";
Boolean numStarted;
for(char c:str.toCharArray()){
if(isNumber(c)){
singleNum+=c;
} else {
if(!singleNum.equals("")){ //number ended
list.add(Double.valueOf(singleNum));
System.out.println(singleNum);
singleNum="";
}
}
}
return list;
}
public static boolean isNumber(char c){
if(Character.isDigit(c)||c=='-'||c=='+'||c=='.'){
return true;
} else {
return false;
}
}
Fraction and grouping characters for representing real numbers may differ between languages. The same real number could be written in very different ways depending on the language.
The number two million in German
2,000,000.00
and in English
2.000.000,00
A method to fully extract real numbers from a given string in a language agnostic way:
public List<BigDecimal> extractDecimals(final String s, final char fraction, final char grouping) {
List<BigDecimal> decimals = new ArrayList<BigDecimal>();
//Remove grouping character for easier regexp extraction
StringBuilder noGrouping = new StringBuilder();
int i = 0;
while(i >= 0 && i < s.length()) {
char c = s.charAt(i);
if(c == grouping) {
int prev = i-1, next = i+1;
boolean isValidGroupingChar =
prev >= 0 && Character.isDigit(s.charAt(prev)) &&
next < s.length() && Character.isDigit(s.charAt(next));
if(!isValidGroupingChar)
noGrouping.append(c);
i++;
} else {
noGrouping.append(c);
i++;
}
}
//the '.' character has to be escaped in regular expressions
String fractionRegex = fraction == POINT ? "\\." : String.valueOf(fraction);
Pattern p = Pattern.compile("-?(\\d+" + fractionRegex + "\\d+|\\d+)");
Matcher m = p.matcher(noGrouping);
while (m.find()) {
String match = m.group().replace(COMMA, POINT);
decimals.add(new BigDecimal(match));
}
return decimals;
}
If you want to exclude numbers that are contained within words, such as bar1 or aa1bb, then add word boundaries \b to any of the regex based answers. For example:
Pattern p = Pattern.compile("\\b-?\\d+\\b");
Matcher m = p.matcher("9There 9are more9 th9an -2 and less than 12 numbers here9");
while (m.find()) {
System.out.println(m.group());
}
displays:
2
12
I would suggest to check the ASCII values to extract numbers from a String
Suppose you have an input String as myname12345 and if you want to just extract the numbers 12345 you can do so by first converting the String to Character Array then use the following pseudocode
for(int i=0; i < CharacterArray.length; i++)
{
if( a[i] >=48 && a[i] <= 58)
System.out.print(a[i]);
}
once the numbers are extracted append them to an array
Hope this helps
I found this expression simplest
String[] extractednums = msg.split("\\\\D++");
public static String extractNumberFromString(String number) {
String num = number.replaceAll("[^0-9]+", " ");
return num.replaceAll(" ", "");
}
extracts only numbers from string