Regex to replace a repeating string pattern

Regex to replace a repeating string pattern - java

I need to replace a repeated pattern within a word with each basic construct unit. For example
I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words.
I am trying to do it in Java with replaceAll method.

I think you want this (works for any length of the repeated string):
String result = source.replaceAll("(.+)\\1+", "$1")
Or alternatively, to prioritize shorter matches:
String result = source.replaceAll("(.+?)\\1+", "$1")
It matches first a group of letters, and then it again (using back-reference within the match pattern itself). I tried it and it seems to do the trick.
Example
String source = "HEY HEY duuuuuuude what'''s up? Trololololo yeye .0.0.0";
System.out.println(source.replaceAll("(.+?)\\1+", "$1"));
// HEY dude what's up? Trolo ye .0

You had better use a Pattern here than .replaceAll(). For instance:
private static final Pattern PATTERN
= Pattern.compile("\\b([A-Z]{2,}?)\\1+\\b");
//...
final Matcher m = PATTERN.matcher(input);
ret = m.replaceAll("$1");
edit: example:
public static void main(final String... args)
{
System.out.println("TATATA GHRGHRGHRGHR"
.replaceAll("\\b([A-Za-z]{2,}?)\\1+\\b", "$1"));
}
This prints:
TA GHR

Since you asked for a regex solution:
(\\w)(\\w)(\\1\\2){2,};
(\w)(\w): matches every pair of consecutive word characters ((.)(.) will catch every consecutive pair of characters of any type), storing them in capturing groups 1 and 2. (\\1\\2) matches anytime the characters in those groups are repeated again immediately afterward, and {2,} matches when it repeats two or more times ({2,10} would match when it repeats more than one but less than ten times).
String s = "hello TATATATA world";
Pattern p = Pattern.compile("(\\w)(\\w)(\\1\\2){2,}");
Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group());
//prints "TATATATA"

Related

Find ALL matches of a regex pattern in Java - even overlapping ones [duplicate]

This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}

By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...

If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9

Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g

As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit

Regex look ahead to seperate string into tokens

I currently have the following code which allows me to find matches from a String.
I need to be able to find all words similar to 64xand split them up into tokens, so I'll get 64 and x as the output.
I have looked at regexs lookahead and this does not solve the issue, is there a way to do this without creating a new arraylist to store matches similar to 64x then splitting them up?
String input = "Hello world 65x";
ArrayList<String> userInput = new ArrayList<>();
Matcher isMatch = Pattern.compile("[0-9]*+[a-zA-Z]")
.matcher(input);
while (isMatch.find()) {
userInput.add(isMatch.group());
}

You can try the following regular expression:
\b(\p{Digit}+)(\p{Alpha})\b
Additionally, if you plan to use the regular expression very often, it is recommended to use a constant in order to avoid recompile it each time, e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("\\b(\\p{Digit}+)(\\p{Alpha})\\b");
public static void main(String[] args) {
String input = "Hello world 65x";
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
}
Output:
65
x

No need of lookaheads, you can use nested captured groups:
Matcher isMatch = Pattern.compile("\\b([0-9]+)([a-zA-Z])\\b");
Group #1 will contain 65 and group #2 will contain x.
Better to add \\b (word boundary) on either side to avoid matching abc56xyz

You just need to use Matcher.group(int). This lets you extract pieces of the matched text. Read about caputring groups here. A regex that contains capturing groups is \\b([0-9]+)([a-zA-Z])\\b (as given by anubhava).

JAVA split with regex doesn't work

I have the following String 46MTS007 and i have to split numbers from letters so in result i should get an array like {"46", "MTS", "007"}
String s = "46MTS007";
String[] spl = s.split("\\d+|\\D+");
But spl remains empty, what's wrong with the regex? I've tested in regex101 and it's working like expected (with global flag)

If you want to use split you can use this lookaround based regex:
(?<=\d)(?=\D)|(?<=\D)(?=\d)
RegEx Demo
Which means split the places where next position is digit and previous is non-digit OR when position is non-digit and previous position is a digit.
In Java:
String s = "46MTS007";
String[] spl = s.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");

Regex you're using will not split the string. Split() splits the string with regex you provide but regex used here matches with whole string not the delimiter. You can use Pattern Matcher to find different groups in a string.
public static void main(String[] args) {
String line = "46MTS007";
String regex = "\\D+|\\d+";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(line);
while(m.find())
System.out.println(m.group());
}
Output:
46
MTS
007
Note: Don't forget to user m.find() after capturing each group otherwise it'll not move to next one.

regex pattern - extract a string only if separated by a hyphen

I've looked at other questions, but they didn't lead me to an answer.
I've got this code:
Pattern p = Pattern.compile("exp_(\\d{1}-\\d)-(\\d+)");
The string I want to be matched is: exp_5-22-718
I would like to extract 5-22 and 718. I'm not too sure why it's not working What am I missing? Many thanks

Try this one:
Pattern p = Pattern.compile("exp_(\\d-\\d+)-(\\d+)");
In your original pattern you specified that second number should contain exactly one digit, so I put \d+ to match as more digits as we can.
Also I removed {1} from the first number definition as it does not add value to regexp.

If the string is always prefixed with exp_ I wouldn't use a regular expression.
I would:
replaceFirst() exp_
split() the resulting string on -
Note: This answer is based on the assumptions. I offer it as a more robust if you have multiple hyphens. However, if you need to validate the format of the digits then a regular expression may be better.

In your regexp you missed required quantifier for second digit \\d. This quantifier is + or {2}.
String yourString = "exp_5-22-718";
Matcher matcher = Pattern.compile("exp_(\\d-\\d+)-(\\d+)").matcher(yourString);
if (matcher.find()) {
System.out.println(matcher.group(1)); //prints 5-22
System.out.println(matcher.group(2)); //prints 718
}

You can use the string.split methods to do this. Check the following code.
I assume that your strings starts with "exp_".
String str = "exp_5-22-718";
if (str.contains("-")){
String newStr = str.substring(4, str.length());
String[] strings = newStr.split("-");
for (String string : strings) {
System.out.println(string);
}
}

java regular expression

Can anyone please help me do the following in a java regular expression?
I need to read 3 characters from the 5th position from a given String ignoring whatever is found before and after.
Example : testXXXtest
Expected result : XXX

You don't need regex at all.
Just use substring: yourString.substring(4,7)
Since you do need to use regex, you can do it like this:
Pattern pattern = Pattern.compile(".{4}(.{3}).*");
Matcher matcher = pattern.matcher("testXXXtest");
matcher.matches();
String whatYouNeed = matcher.group(1);
What does it mean, step by step:
.{4} - any four characters
( - start capturing group, i.e. what you need
.{3} - any three characters
) - end capturing group, you got it now
.* followed by 0 or more arbitrary characters.
matcher.group(1) - get the 1st (only) capturing group.

You should be able to use the substring() method to accomplish this:
string example = "testXXXtest";
string result = example.substring(4,7);

This might help: Groups and capturing in java.util.regex.Pattern.
Here is an example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
public static void main(String[] args) {
String text = "This is a testWithSomeDataInBetweentest.";
Pattern p = Pattern.compile("test([A-Za-z0-9]*)test");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Matched: " + m.group(1));
} else {
System.out.println("No match.");
}
}
}
This prints:
Matched: WithSomeDataInBetween
If you don't want to match the entire pattern rather to the input string (rather than to seek a substring that would match), you can use matches() instead of find(). You can continue searching for more matching substrings with subsequent calls with find().
Also, your question did not specify what are admissible characters and length of the string between two "test" strings. I assumed any length is OK including zero and that we seek a substring composed of small and capital letters as well as digits.

You can use substring for this, you don't need a regex.
yourString.substring(4,7);
I'm sure you could use a regex too, but why if you don't need it. Of course you should protect this code against null and strings that are too short.

Use the String.replaceAll() Class Method
If you don't need to be performance optimized, you can try the String.replaceAll() class method for a cleaner option:
String sDataLine = "testXXXtest";
String sWhatYouNeed = sDataLine.replaceAll( ".{4}(.{3}).*", "$1" );
References
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#using-regular-expressions-with-string-methods

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to replace a repeating string pattern - java

I need to replace a repeated pattern within a word with each basic construct unit. For example I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words. I am trying to do it in Java with replaceAll method.

Related

Find ALL matches of a regex pattern in Java - even overlapping ones [duplicate]

Regex look ahead to seperate string into tokens

JAVA split with regex doesn't work

regex pattern - extract a string only if separated by a hyphen

java regular expression

Categories

Resources