As title says, I've a string and I want to extract some data from It.
This is my String:
text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
and I want to extract all the data between the pipes: tab_PRO, 1, 1...and so on
.
I've tried:
Pattern p = Pattern.compile("\\|(.*?)\\|");
Matcher m = p.matcher(text);
while(m.find())
{
for(int i = 1; i< 10; i++) {
test = m.group(i);
System.out.println(test);
}
}
and with this i get the first group that's tab_PRO. But i also get an error
java.lang.IndexOutOfBoundsException: No group 2
Now, probably I didn't understand quite well how the groups works, but I thought that with this I could get the remaining data that I need. I'm not able to understand what I'm missing.
Thanks in advance
Use String.split(). Take into account it expects a regex as an argument, and | is a reserved regex operand, so you'll need to escape it with a \. So, make it two \ so \| won't be interpreted as if you're using an - invalid - escape sequence for the | character:
String[] parts = text.split("\\|");
See it working here:
https://ideone.com/WibjUm
If you want to go with your regex approach, you'll need to group and capture every repetition of characters after every | and restrict them to be anything except |, possibly using a regex like \\|([^\\|]*).
In your loop, you iterate over m.find() and just use capture group 1 because its the only group every match will have.
String text = "|tab_PRO|1|1|#tRecordType#||0|tab_PRO|";
Pattern p = Pattern.compile("\\|([^\\|]*)");
Matcher m = p.matcher(text);
while(m.find()){
System.out.println(m.group(1));
}
https://ideone.com/RNjZRQ
Try using .split() or .substring()
As mentioned in the comments, this is easier done with String.split.
As for your own code, you are unnecessarily using the inner loop, and that's leading to that exception. You only have one group, but the for loop will cause you to query more than one group. Your loop should be as simple as:
Pattern p = Pattern.compile("(?<=\\|)(.*?)\\|");
Matcher m = p.matcher(text);
while (m.find()) {
String test = m.group(1);
System.out.println(test);
}
And that prints
tab_PRO
1
1
#tRecordType#
0
tab_PRO
Note that I had to use a look-behind assertion in your regex.
This question already has answers here:
Matcher not finding overlapping words?
(4 answers)
Closed 4 years ago.
I have a String of the form:
1,2,3,4,5,6,7,8,...
I am trying to find all substrings in this string that contain exactly 4 digits. For this I have the regex [0-9],[0-9],[0-9],[0-9]. Unfortunately when I try to match the regex against my String, I never obtain all the substrings, only a part of all the possible substrings. For instance, in the example above I would only get:
1,2,3,4
5,6,7,8
although I expect to get:
1,2,3,4
2,3,4,5
3,4,5,6
...
How would I go about finding all matches corresponding to my regex?
for info, I am using Pattern and Matcher to find the matches:
Pattern pattern = Pattern.compile([0-9],[0-9],[0-9],[0-9]);
Matcher matcher = pattern.matcher(myString);
List<String> matches = new ArrayList<String>();
while (matcher.find())
{
matches.add(matcher.group());
}
By default, successive calls to Matcher.find() start at the end of the previous match.
To find from a specific location pass a start position parameter to find of one character past the start of the previous find.
In your case probably something like:
while (matcher.find(matcher.start()+1))
This works fine:
Pattern p = Pattern.compile("[0-9],[0-9],[0-9],[0-9]");
public void test(String[] args) throws Exception {
String test = "0,1,2,3,4,5,6,7,8,9";
Matcher m = p.matcher(test);
if(m.find()) {
do {
System.out.println(m.group());
} while(m.find(m.start()+1));
}
}
printing
0,1,2,3
1,2,3,4
...
If you are looking for a pure regex based solution then you may use this lookahead based regex for overlapping matches:
(?=((?:[0-9],){3}[0-9]))
Note that your matches are available in captured group #1
RegEx Demo
Code:
final String regex = "(?=((?:[0-9],){3}[0-9]))";
final String string = "0,1,2,3,4,5,6,7,8,9";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
Code Demo
output:
0,1,2,3
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
Some sample code without regex (since it seems not useful to me). Also I would assume regex to be slower in this case. Yet it will only work as it is as long as the numbers are only 1 character long.
String s = "a,b,c,d,e,f,g,h";
for (int i = 0; i < s.length() - 8; i+=2) {
System.out.println(s.substring(i, i + 7));
}
Ouput for this string:
a,b,c,d
b,c,d,e
c,d,e,f
d,e,f,g
As #OldCurmudgeon pointed out, find() by default start looking from the end of the previous match. To position it right after the first matched element, introduce the first matched region as a capturing group, and use it's end index:
Pattern pattern = Pattern.compile("(\\d,)\\d,\\d,\\d");
Matcher matcher = pattern.matcher("1,2,3,4,5,6,7,8,9");
List<String> matches = new ArrayList<>();
int start = 0;
while (matcher.find(start)) {
start = matcher.end(1);
matches.add(matcher.group());
}
System.out.println(matches);
results in
[1,2,3,4, 2,3,4,5, 3,4,5,6, 4,5,6,7, 5,6,7,8, 6,7,8,9]
This approach would also work if your matching region is longer than one digit
Can you help with this code?
It seems easy, but always fails.
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
//Matcher matches = Pattern.compile( "([A-Z0-9])" ).matcher("P-12345678-P");
Matcher matches = Pattern.compile( "([\\w])" ).matcher("P-12345678-P");
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
assertEquals("P12345678P", ret.toString());
}
Constructing a Matcher does not automatically perform any matching. That's in part because Matcher supports two distinct matching behaviors, differing in whether the match is implicitly anchored to the beginning of the Matcher's region. It appears that you could achieve your desired result like so:
#Test
public void normalizeString(){
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]+" ).matcher("P-12345678-P");
while (matches.find()) {
ret.append(matches.group());
}
assertEquals("P12345678P", ret.toString());
}
Note in particular the invocation of Matcher.find(), which was a key omission from your version. Also, the nullary Matcher.group() returns the substring matched by the last find().
Furthermore, although your use of Matcher.groupCount() isn't exactly wrong, it does lead me suspect that you have the wrong idea about what it does. In particular, in your code it will always return 1 -- it inquires about the pattern, not about matches to it.
First of all you don't need to add any group because entire match can be always accessed by group 0, so instead of
(regex) and group(1)
you can use
regex and group(0)
Next thing is that \\w is already character class so you don't need to surround it with another [ ], because it will be similar to [[a-z]] which is same as [a-z].
Now in your
for (int i = 1; i < matches.groupCount(); i++)
ret.append(matches.group(i));
you will iterate over all groups from 1 but you will exclude last group, because they are indexed from 1 so n so i<n will not include n. You would need to use i <= matches.groupCount() instead.
Also it looks like you are confusing something. This loop will not find all matches of regex in input. Such loop is used to iterate over groups in used regex after match for regex was found.
So if regex would be something like (\w(\w))c and your match would be like abc then
for (int i = 1; i < matches.groupCount(); i++)
System.out.println(matches.group(i));
would print
ab
b
because
first group contains two characters (\w(\w)) before c
second group is the one inside first one, right after first character.
But to print them you actually would need to first let regex engine iterate over your input and find() match, or check if entire input matches() regex, otherwise you would get IllegalStateException because regex engine can't know from which match you want to get your groups (there can be many matches of regex in input).
So what you may want to use is something like
StringBuilder ret = new StringBuilder();
Matcher matches = Pattern.compile( "[A-Z0-9]" ).matcher("P-12345678-P");
while (matches.find()){//find next match
ret.append(matches.group(0));
}
assertEquals("P12345678P", ret.toString());
Other way around (and probably simpler solution) would be actually removing all characters you don't want from your input. So you could just use replaceAll and negated character class [^...] like
String input = "P-12345678-P";
String result = input.replaceAll("[^A-Z0-9]+", "");
which will produce new string in which all characters which are not A-Z0-9 will be removed (replaced with "").
I need to replace a repeated pattern within a word with each basic construct unit. For example
I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words.
I am trying to do it in Java with replaceAll method.
I think you want this (works for any length of the repeated string):
String result = source.replaceAll("(.+)\\1+", "$1")
Or alternatively, to prioritize shorter matches:
String result = source.replaceAll("(.+?)\\1+", "$1")
It matches first a group of letters, and then it again (using back-reference within the match pattern itself). I tried it and it seems to do the trick.
Example
String source = "HEY HEY duuuuuuude what'''s up? Trololololo yeye .0.0.0";
System.out.println(source.replaceAll("(.+?)\\1+", "$1"));
// HEY dude what's up? Trolo ye .0
You had better use a Pattern here than .replaceAll(). For instance:
private static final Pattern PATTERN
= Pattern.compile("\\b([A-Z]{2,}?)\\1+\\b");
//...
final Matcher m = PATTERN.matcher(input);
ret = m.replaceAll("$1");
edit: example:
public static void main(final String... args)
{
System.out.println("TATATA GHRGHRGHRGHR"
.replaceAll("\\b([A-Za-z]{2,}?)\\1+\\b", "$1"));
}
This prints:
TA GHR
Since you asked for a regex solution:
(\\w)(\\w)(\\1\\2){2,};
(\w)(\w): matches every pair of consecutive word characters ((.)(.) will catch every consecutive pair of characters of any type), storing them in capturing groups 1 and 2. (\\1\\2) matches anytime the characters in those groups are repeated again immediately afterward, and {2,} matches when it repeats two or more times ({2,10} would match when it repeats more than one but less than ten times).
String s = "hello TATATATA world";
Pattern p = Pattern.compile("(\\w)(\\w)(\\1\\2){2,}");
Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group());
//prints "TATATATA"
Would anyone be able to assist me with some regex.
I want to split the following string into a number, string number
"810LN15"
1 method requires 810 to be returned, another requires LN and another should return 15.
The only real solution to this is using regex as the numbers will grow in length
What regex can I used to accomodate this?
String.split won't give you the desired result, which I guess would be "810", "LN", "15", since it would have to look for a token to split at and would strip that token.
Try Pattern and Matcher instead, using this regex: (\d+)|([a-zA-Z]+), which would match any sequence of numbers and letters and get distinct number/text groups (i.e. "AA810LN15QQ12345" would result in the groups "AA", "810", "LN", "15", "QQ" and "12345").
Example:
Pattern p = Pattern.compile("(\\d+)|([a-zA-Z]+)");
Matcher m = p.matcher("810LN15");
List<String> tokens = new LinkedList<String>();
while(m.find())
{
String token = m.group( 1 ); //group 0 is always the entire match
tokens.add(token);
}
//now iterate through 'tokens' and check whether you have a number or text
In Java, as in most regex flavors (Python being a notable exception), the split() regex isn't required to consume any characters when it finds a match. Here I've used lookaheads and lookbehinds to match any position that has a digit one side of it and a non-digit on the other:
String source = "810LN15";
String[] parts = source.split("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)");
System.out.println(Arrays.toString(parts));
output:
[810, LN, 15]
(\\d+)([a-zA-Z]+)(\\d+) should do the trick. The first capture group will be the first number, the second capture group will be the letters in between and the third capture group will be the second number. The double backslashes are for java.
This gives you the exact thing you guys are looking for
Pattern p = Pattern.compile("(([a-zA-Z]+)|(\\d+))|((\\d+)|([a-zA-Z]+))");
Matcher m = p.matcher("810LN15");
List<Object> tokens = new LinkedList<Object>();
while(m.find())
{
String token = m.group( 1 );
tokens.add(token);
}
System.out.println(tokens);