Using Regex to ignore a pattern in java - java

I have a sentence: "we:PR show:V".
I want to match only those characters after ":" and before "\\s" using regex pattern matcher.
I used following pattern:
Pattern pattern=Pattern.compile("^(?!.*[\\w\\d\\:]).*$");
But it did not work.
What is the best pattern to get the output?

For a situation such as this, if you are using java, it may be easier to do something with substrings:
String input = "we:PR show:V";
String colon = ":";
String space = " ";
List<String> results = new ArrayList<String>();
int spaceLocation = -1;
int colonLocation = input.indexOf(colon);
while (colonLocation != -1) {
spaceLocation = input.indexOf(space);
spaceLocation = (spaceLocation == -1 ? input.size() : spaceLocation);
results.add(input.substring(colonLocation+1,spaceLocation);
if(spaceLocation != input.size()) {
input = input.substring(spaceLocation+1, input.size());
} else {
input = new String(); //reached the end of the string
}
}
return results;
This will be faster than trying to match on regex.

The following regex assumes that any non-whitespace characters following a colon (in turn preceded by non-colon characters) are a valid match:
[^:]+:(\S+)(?:\s+|$)
Use like:
String input = "we:PR show:V";
Pattern pattern = Pattern.compile("[^:]+:(\\S+)(?:\\s+|$)");
Matcher matcher = pattern.matcher(input);
int start = 0;
while (matcher.find(start)) {
String match = matcher.group(1); // = "PR" then "V"
// Do stuff with match
start = matcher.end( );
}
The pattern matches, in order:
At least one character that isn't a colon.
A colon.
At least non-whitespace character (our match).
At least one whitespace character, or the end of input.
The loop continues as long as the regex matches an item in the string, beginning at the index start, which is always adjusted to point to after the end of the current match.

Related

How to replace multiple consecutive occurrences of a character with a maximum allowed number of occurences?

CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
String replace = "-";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
boolean isMatch = matcher.find();
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < content.length(); i++) {
while (matcher.find()) {
matcher.appendReplacement(buffer, replace);
}
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
In the above code content is input string,
I am trying to find repetitive occurrences from string and want to replace it with max no of occurrences
For Example
input -("abaaadccc",2)
output - "abaadcc"
here aaaand cccis replced by aa and cc as max allowed repitation is 2
In the above code, I found such occurrences and tried replacing them with -, it's working, But can someone help me How can I get current char and replace with allowed occurrences
i.e If aaa is found it is replaced by aa
or is there any alternative method w/o using regex?
You can declare the second group in a regex and use it as a replacement:
String result = "aaabbbccaaa".replaceAll("(([a-zA-Z])\\2)\\2+", "$1");
Here's how it works:
( first group - a character repeated two times
([a-zA-Z]) second group - a character
\2 a character repeated once
)
\2+ a character repeated at least once more
Thus, the first group captures a replacement string.
It isn't hard to extrapolate this solution for a different maximum value of allowed repeats:
String input = "aaaaabbcccccaaa";
int maxRepeats = 4;
String pattern = String.format("(([a-zA-Z])\\2{%s})\\2+", maxRepeats-1);
String result = input.replaceAll(pattern, "$1");
System.out.println(result); //aaaabbccccaaa
Since you defined a group in your regex, you can get the matching characters of this group by calling matcher.group(1). In your case it contains the first character from the repeating group so by appending it twice you get your expected result.
CharSequence content = new StringBuffer("aaabbbccaaa");
String pattern = "([a-zA-Z])\\1\\1+";
Pattern patt = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher matcher = patt.matcher(content);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
System.out.println("found : "+matcher.start()+","+matcher.end()+":"+matcher.group(1));
matcher.appendReplacement(buffer, matcher.group(1)+matcher.group(1));
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
Output:
found : 0,3:a
found : 3,6:b
found : 8,11:a
aabbccaa

How to Regex replace characters at the end of a string

I have an issue in Java when trying to remove the characters from the end of a string. This has now become a generic pattern match issue that I cannot resolve.
PROBLEM = remove all pluses, minuses and spaces (not bothered about whitespace) from the end of a string.
Pattern myRegex;
Matcher myMatch;
String myPattern = "";
String myString = "";
String myResult = "";
myString="surname; forename--+ + --++ "
myPattern="^(.*)[-+ ]*$"
//expected result = "surname; forename"
myRegex = Pattern.compile(myPattern);
myMatch = myRegex.matcher(myString);
if (myMatch.find( )) {
myResult = myMatch.group(1);
} else {
myResult = myString;
}
The only way I can get this to work is by reversing the string and reversing the pattern match, then I reverse the result to get the right answer!
In the following pattern:
^(.*)[-+ ]*$
... the .* is a greedy match. This means that it will match as many characters as possible while still allowing the entire pattern to match.
You need to change it to non-greedy by adding ?.
^(.*?)[-+ ]*$

Java regex skipping matches

I have some text; I want to extract pairs of words that are not separated by punctuation. This is the code:
//n-grams
Pattern p = Pattern.compile("[a-z]+");
if (n == 2) {
p = Pattern.compile("[a-z]+ [a-z]+");
}
if (n == 3) {
p = Pattern.compile("[a-z]+ [a-z]+ [a-z]+");
}
Matcher m = p.matcher(text.toLowerCase());
ArrayList<String> result = new ArrayList<String>();
while (m.find()) {
String temporary = m.group();
System.out.println(temporary);
result.add(temporary);
}
The problem is that it skips some matches. For example
"My name is James"
, for n = 3, must match
"my name is" and "name is james"
, but instead it matches just the first. Is there a way to solve this?
You can capture it using groups in lookahead
(?=(\b[a-z]+\b \b[a-z]+\b \b[a-z]+\b))
This causes it to capture in two groups..So in your case it would be
Group1->my name is
Group2->name is james
In regular expression pattern defined by regex is applied on the String from left to right and once a source character is used in a match, it can’t be reused.
For example, regex “121″ will match “31212142121″ only twice as “121___121″.
I tend to use the argument to the find() method of Matcher:
Matcher m = p.matcher(text);
int position = 0;
while (m.find(position)) {
String temporary = m.group();
position = m.start();
System.out.println(position + ":" + temporary);
position++;
}
So after each iteration, it searches again based on the last start index.
Hope that helped!

JAVA RegEx on _ delimited string

OK, I need a RegEx that traps the first word up to underscore character but then capture the next words that may have a underscore character. So, here is a group and the expected result:
gear_Armor_Blessed_Robes = "gear", "Armor" and "Blessed_Robes"
gear_Armor_Chain_Coif = "gear", "Armor" and "Chain_Coif"
gear_Armor_Chain_Hauberk = "gear", "Armor" and "Chain_Hauberk"
gear_Armor_Chain_Shirt = "gear", "Armor" and "Chain_Shirt"
gear_Armor_Chain_Leggings = "gear", "Armor" and "Chain_Leggings"
There's no need to use a regex for this, just use indexOf and substring:
String s = "Armor_Blessed_Robes";
int idx = s.indexOf("_");
System.out.println(s.substring(0, idx)); // Armor
System.out.println(s.substring(idx + 1)); // Blessed_Robes
With regex, you'd have to use the following, which is a tad more complicated and harder to read:
Pattern p = Pattern.compile("([^_]+)_(.+)");
Matcher m = p.matcher(s);
if (m.find()) {
String first = m.group(1); // Armor
String second = m.group(2); // Blessed_Robes
}
You can split along _, limiting the number of splits to 3:
assert Arrays.equals("gear_Armor_Blessed_Robes".split("_", 3),
new String[] { "gear", "Armor", "Blessed_Robes" });
It should give you a String[] that contains the 3 Strings as specified in your question.

Why isn't this lookahead assertion working in Java?

I come from a Perl background and am used to doing something like the following to match leading digits in a string and perform an in-place increment by one:
my $string = '0_Beginning';
$string =~ s|^(\d+)(?=_.*)|$1+1|e;
print $string; # '1_Beginning'
With my limited knowledge of Java, things aren't so succinct:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
String digit = string.replaceFirst( p.toString(), "$1" ); // To get the digit
Integer oneMore = Integer.parseInt( digit ) + 1; // Evaluate ++digit
string.replaceFirst( p.toString(), oneMore.toString() ); //
The regex doesn't match here... but it did in Perl.
What am I doing wrong here?
Actually it matches. You can find out by printing
System.out.println(p.matcher(string).find());
The issue is with line
String digit = string.replaceFirst( p.toString(), "$1" );
which is actually a do-nothing, because it replaces the first group (which is all you match, the lookahead is not part of the match) with the content of the first group.
You can get the desired result (namely the digit) via the following code
Matcher m = p.matcher(string);
String digit = m.find() ? m.group(1) : "";
Note: you should check m.find() anyways if nothing matches. In this case you may not call parseInt and you'll get an error. Thus the full code looks something like
Pattern p = Pattern.compile("^(\\d+)(?=_.*)");
String string = "0_Beginning";
Matcher m = p.matcher(string);
if (m.find()) {
String digit = m.group(1);
Integer oneMore = Integer.parseInt(digit) + 1;
string = m.replaceAll(oneMore.toString());
System.out.println(string);
} else {
System.out.println("No match");
}
Let's see what you are doing here.
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
You declare and initialize String and pattern objects.
String digit = string.replaceFirst( p.toString(), "$1" ); // To get the digit
(You are converting the pattern back into a string, and replaceFirst creates a new Pattern from this. Is this intentional?)
As Howard says, this replaces the first match of the pattern in the string with the contents of the first group, and the match of the pattern is just 0 here, as the first group. Thus digit is equal to string, ...
Integer oneMore = Integer.parseInt( digit ) + 1; // Evaluate ++digit
... and your parsing fails here.
string.replaceFirst( p.toString(), oneMore.toString() ); //
This would work (but convert the pattern again to string and back to pattern).
Here how I would do this:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
Matcher matcher = p.matcher(string);
StringBuffer result = new StringBuffer();
while(matcher.find()) {
int number = Integer.parseInt(matcher.group());
m.appendReplacement(result, String.valueOf(number + 1));
}
m.appendTail(result);
return result.toString(); // 1_Beginning
(Of course, for your regex the loop will only execute once, since the regex is anchored.)
Edit: To clarify my statement about string.replaceFirst:
This method does not return a pattern, but uses one internally. From the documentation:
Replaces the first substring of this string that matches the given regular expression with the given replacement.
An invocation of this method of the form str.replaceFirst(regex, repl) yields exactly the same result as the expression
Pattern.compile(regex).matcher(str).replaceFirst(repl)
Here we see that a new pattern is compiled from the first argument.
This also shows us another way to do what you did want to do:
String string = "0_Beginning";
Pattern p = Pattern.compile( "^(\\d+)(?=_.*)" );
Matcher m = p.matcher(string);
if(m.find()) {
digit = m.group();
int oneMore = Integer.parseInt( digit ) + 1
return m.replaceFirst(string, String.valueOf(oneMore));
}
This only compiles the pattern once, instead of thrice like in your original program - but still does the matching twice (once for find, once for replaceFirst), instead of once like in my program.

Categories