what is wrong with this java regex? - java

final static private Pattern includePattern = Pattern.compile("^\\s+([^\\s]*)");
...
Matcher mtest = includePattern.matcher(" this.txt");
String ftest = mtest.group(1);
I get an exception No match found at java.util.regex.Matcher.group(Matcher.java:468)
I'm looking for at least 1 space character followed by a captured group of nonspace characters. Where have I gone wrong?

You'll first need to call .find() before you can use group(...).
Note that find() returns a boolean, so it's safe(r) to do something like this:
final static private Pattern includePattern = Pattern.compile("^\\s+([^\\s]*)");
Matcher mtest = includePattern.matcher(" this.txt");
String ftest = m.find() ? mtest.group(1) : null;
And [^\\s] could be rewritten as \\S (capital s).
You might have simplified your example a bit in your question, but I assume you're aware of the fact that String.trim() takes care of any leading and trailing white space characters.

Related

How to find the first word of a sentence for only three different options

I can have these three kind of Strings
ALPHA_whatever_1234567
BETA_whateverDifferent_7654321
GAMMA_anotherOption_1237654
I want to extract from the Strings the beginning of them, whether is ALPHA, BETA or GAMMA.
So, for example, I would like to get:
ALPHA_whatever_1234567 -> ALPHA
BETA_whateverDifferent_7654321 -> BETA
GAMMA_anotherOption_1237654 -> GAMMA
I want to use Regular Expression, and I tried something like this
private static final Pattern PATTERN = Pattern.compile("(.*)_.*");
But it doesn't work for some Strings. I recover the beginning by
Matcher m = PATTERN.matcher(string);
m.find(1);
I also tried this Pattern:
private static final Pattern PATTERN = Pattern.compile("([ALPHA]|[BETA]|[GAMMA])_.*");
But it returns only the first character of the String.
What am I doing wrong?
Just remove the brackets around the ALPHA, BETA and GAMMA since they represent character classes, i.e. [ALPHA] will match either of the letters A, L, P, H or A.
private static final Pattern PATTERN = Pattern.compile("(ALPHA|BETA|GAMMA)_.*");
Your regex does not work because dot . consumes too much, eating up the underscore. Here is how you can fix it:
private static final Pattern PATTERN = Pattern.compile("([^_]*)_.*");
Another alternative would be to use a "reluctant" qualifier for the asterisk, but that may lead to catastrophic backtracking.
Your other solution uses character classes [] incorrectly. The correct expression would have no square brackets, like this:
private static final Pattern PATTERN = Pattern.compile("(ALPHA|BETA|GAMMA)_.*");
[...] in regex is a character class. A character class can only match a single character.
So [ALPHA] really means "match one of these characters: A, L, P, H, A"
If you remove the brackets, then it will match the entire word:
(ALPHA|BETA|GAMMA)_.*
If you are not insistent on using regular expressions, you could give this a try:
String firstWord = myString.split("_")[0];
Where myString contains your String.
String strr = "ALPHA_whatever_1234567";
String[] result = strr.split("_");
return result[0];

Regex to replace a repeating string pattern

I need to replace a repeated pattern within a word with each basic construct unit. For example
I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words.
I am trying to do it in Java with replaceAll method.
I think you want this (works for any length of the repeated string):
String result = source.replaceAll("(.+)\\1+", "$1")
Or alternatively, to prioritize shorter matches:
String result = source.replaceAll("(.+?)\\1+", "$1")
It matches first a group of letters, and then it again (using back-reference within the match pattern itself). I tried it and it seems to do the trick.
Example
String source = "HEY HEY duuuuuuude what'''s up? Trololololo yeye .0.0.0";
System.out.println(source.replaceAll("(.+?)\\1+", "$1"));
// HEY dude what's up? Trolo ye .0
You had better use a Pattern here than .replaceAll(). For instance:
private static final Pattern PATTERN
= Pattern.compile("\\b([A-Z]{2,}?)\\1+\\b");
//...
final Matcher m = PATTERN.matcher(input);
ret = m.replaceAll("$1");
edit: example:
public static void main(final String... args)
{
System.out.println("TATATA GHRGHRGHRGHR"
.replaceAll("\\b([A-Za-z]{2,}?)\\1+\\b", "$1"));
}
This prints:
TA GHR
Since you asked for a regex solution:
(\\w)(\\w)(\\1\\2){2,};
(\w)(\w): matches every pair of consecutive word characters ((.)(.) will catch every consecutive pair of characters of any type), storing them in capturing groups 1 and 2. (\\1\\2) matches anytime the characters in those groups are repeated again immediately afterward, and {2,} matches when it repeats two or more times ({2,10} would match when it repeats more than one but less than ten times).
String s = "hello TATATATA world";
Pattern p = Pattern.compile("(\\w)(\\w)(\\1\\2){2,}");
Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group());
//prints "TATATATA"

Regex to get the string after # sign

I have a string like follows:
#78517700-1f01-11e3-a6b7-3c970e02b4ec, #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec ....
I want to extract the string after #.
I have the current code like follows:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#[^\\s]+");
Matcher m = PATTERN_LOGIN.matcher("#78517700-1f01-11e3-a6b7-3c970e02b4ec , #68517700-1f01-11e3-a6b7-3c970e02b4ec, #98517700-1f01-11e3-a6b7-3c970e02b4ec, #38517700-1f01-11e3-a6b7-3c970e02b4ec");
while (m.find()) {
String mentionedLogin = m.group();
.......
}
... but m.group() gives me #78517700-1f01-11e3-a6b7-3c970e02b4ec but I wanted 78517700-1f01-11e3-a6b7-3c970e02b4ec
You should use the regex "#([^\\s]+)" and then m.group(1), which returns you what "captured" by the capturing parentheses ().
m.group() or m.group(0) return you the full matching string found by your regex.
I would modify the pattern to omit the at sign:
private final static Pattern PATTERN_LOGIN = Pattern.compile("#([^\\s]+)");
So the first group will be the GUID only
Correct answers are mentioned in other responses. I will add some clarification. Your code is working correctly, as expected.
Your regex means: match string which starts with # and after that follows one or more characters which isn't white space. So if you omit the parentheses you get you full string as expected.
The parentheses as mentioned in other responses are used for marking capturing groups. In layman terms - the regex engine does the matching multiple times for each parenthesis enclosed group, working it's way inside the nested structure.

Java Regexp capturing group includes space, why?

I am trying to parse this string,
"斬釘截鐵 斩钉截铁 [zhan3 ding1 jie2 tie3] /to chop the nail and slice the iron (idiom)/resolute and decisive/unhesitating/definitely/without any doubt/";
With this code
private static final Pattern TRADITIONAL = Pattern.compile("(.*?) ");
private String extractSinglePattern(String row, Pattern pattern) {
Matcher matcher = pattern.matcher(row);
if (matcher.find()) {
return matcher.group();
}
return null;
}
However, for some reason the string returned contains a space at the end
org.junit.ComparisonFailure: expected:<斬釘截鐵[]> but was:<斬釘截鐵[ ]>
Is there something wrong with my pattern?
I have also tried
private static final Pattern TRADITIONAL = Pattern.compile("(.*?)\\s");
but to no avail
I have also tried matching with two spaces at the end of the pattern, but it doesn't match (there is only one space).
You're using Matcher.group() which is documented as:
Returns the input subsequence matched by the previous match.
The match includes the space. The capturing group within the match doesn't, but you haven't asked for that.
If you change your return statement to:
return matcher.group(1);
then I believe it'll do what you want.
use this regular expression (.+?)(?=\s+)

Java regex grouping

I have the following entry in a properties file:
some.key = \n
[1:Some value] \n
[14:Some other value] \n
[834:Yet another value] \n
I am trying to parse it using a regular expression, but I can't seem to get the grouping correct. I am trying to print out a key/value for each entry. Example: Key="834", Value="Yet another value"
private static final String REGEX_PATTERN = "[(\\d+)\\:(\\w+(\\s)*)]+";
private void foo(String propValue){
final Pattern p = Pattern.compile(REGEX_PATTERN);
final Matcher m = p.matcher(propValue);
while (m.find()) {
final String key = m.group(0).trim();
final String value = m.group(1).trim();
System.out.println(String.format("Key[%s] Value[%s]", key, value));
}
}
The error I get is:
Exception: java.lang.IndexOutOfBoundsException: No group 1
I thought I was grouping correctly in the regex but I guess not. Any help would be appreciated!
Thanks
UPDATE:
Escaping the brackets worked. Changed the pattern to the followingThanks for the feedback!
private static final String REGEX_PATTERN = "\\[(\\d+)\\:(\\w+(\\w|\\s)*)\\]+";
[ should be escaped (as well as ]).
"\\[(\\d+)....\\]+"
[] Is used for character classes: [0-9] == (0|1|2|...|9)
Try this:
private static final String REGEX_PATTERN = "\\[(\\d+):([\\w\\s]+)\\]";
final Pattern p = Pattern.compile(REGEX_PATTERN);
final Matcher m = p.matcher(propValue);
while (m.find()) {
final String key = m.group(1).trim();
final String value = m.group(2).trim();
System.out.println(String.format("Key[%s] Value[%s]", key, value));
}
the [ and ] need to be escaped because they represent the start and end of a character class
group(0) is always the full match, so your groups should start with 1
note how I wrote the second group [\\w\\s]+. This means a character class of word or whitespace characters
It's your regex, [] are special characters and need to be escaped if you want to interpret them literally.
Try
"\\[(\\d+)\\:(\\w+(\\s)*)\\]"
Note - I removed the '+'. The matcher will keep finding substrings that match the pattern so the + is not necessary. (You might need to feed in a GLOBAL switch - I can't remember).
I can't help but feel this might be simpler without regex though, perhaps by splitting on \n or [ and then splitting on : for each of those.
Since you are using string that consists of several lines you should tell it to Pattern:
final Pattern p = Pattern.compile(REGEX_PATTERN, Pattern.MULTILINE);
Although it is irrelevant directly for you I'd recommend you to add DOTALL too:
final Pattern p = Pattern.compile(REGEX_PATTERN, Pattern.MULTILINE | Pattern.DOTALL);

Categories