Getting the "context" text of a matched group - java

I'm using the Matcher class of Java to get some strings, now when I get my matches, I also find their begin index and end index. Now what I want to do is get the x preceding and proceeding characters.
So what I did was just call the substring method on the string with {begin index minusx} to {end index plusx}, but it seems to be a little heavy, for every match, I'll have to loop the string for it's context.
I wanted to know whether there's a better way to do that.
Here is what I've done so far:
The part that bothers me is the text.substring, how expensive is it
String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("\\d{2}").matcher(text);
int x = 5;
while (matcher.find()) {
String match = matcher.group();
int start = matcher.start();
int end = matcher.end();
String pretext = text.substring(start - x, start);
String postext = text.substring(end, end + x);
System.out.println(pretext + " - " + match + " - " + postext);
}
Suggested answer of using grouping to solve this:
using the regex (.{5})(\d{2}(.{5}).
First of all, this wouldn't be able to captures ones without at least 5 characters before it. So the solution to that is (.{0,5})(\d{2})(.{0.5}), very nice for that simple regex (\d{2})but for one like "c?at" and the given text "cat" this would match the groups
c
at

String text = "Some 22 text with 44 characters";
Matcher matcher = Pattern.compile("(.{5})(\\d{2})(.{5})").matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2) + " - " + matcher.group(3));
}
output :
Some - 22 - text
with - 44 - char

Related

Remove a string if it ends within java

I have to remove "OR" if it ends with in a given string.
public class StringReplaceTest {
public static void main(String[] args) {
String text = "SELECT count OR %' OR";
System.out.println("matches:" + text.matches("OR$"));
Pattern pattern = Pattern.compile("OR$");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println("Found match at: " + matcher.start() + " to " + matcher.end());
System.out.println("substring:" + text.substring(matcher.start(), matcher.end()));
text = text.replace(text.substring(matcher.start(), matcher.end()), "");
System.out.println("after replace:" + text);
}
}
}
Output:
matches:false
Found match at: 19 to 21
substring:OR
after replace:SELECT count %'
Its removing all the occurrences of the string "OR" but I have to remove if its ends with only.
How to do that ?
Also regex is working with Pattern but not working with String.matches().
What is the difference between both and what is the best way to remove a string if it ends with ?
text.matches(".*OR$") as the match goes over the entire string.
Or:
if (text.endsWith("OR"))
Or:
text = text.replaceFirst(" OR$", "");
If you need to just remove the last OR, then I suggest using substring method as it is faster than a full regex pattern. In that case, you can remove the OR using this code:
text.substring(0, text.lastIndexOf("OR"));
If you need to replace OR by something else, you will need to use this code which detects the last OR with a break in the string.
text.replaceFirst("\\bOR$", "SOME");

Count regex matches with streams

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :
final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");
There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :
// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();
// count is 0, wrong
final long count = pattern.splitAsStream("1").count();
I also tried (ab)using an IntStream. The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.
final long count = IntStream
.iterate(0, i -> matcher.find() ? 1 : 0)
.limit(100)
.sum();
I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?
To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.
final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();
The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.
The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.
If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas, not the numbers.
System.out.println("[" + pattern.splitAsStream("1,2,3,4")
.collect(Collectors.joining("!")) + "]");
prints [!,!,!,]. The odd bit is why it is giving you 4 and not 3.
Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.
A quick demo:
private void test(Pattern pattern, String s) {
System.out.println(s + "-[" + pattern.splitAsStream(s)
.collect(Collectors.joining("!")) + "]");
}
public void test() {
final Pattern pattern = Pattern.compile("\\d+");
test(pattern, "1,2,3,4");
test(pattern, "a1b2c3d4e");
test(pattern, "1");
}
prints
1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]
You can extend AbstractSpliterator to solve this:
static class SpliterMatcher extends AbstractSpliterator<Integer> {
private final Matcher m;
public SpliterMatcher(Matcher m) {
super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
this.m = m;
}
#Override
public boolean tryAdvance(Consumer<? super Integer> action) {
boolean found = m.find();
if (found)
action.accept(m.groupCount());
return found;
}
}
final Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1
matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4
matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0
Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?
final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
.filter(str -> str.matches("\\d+"))
.count();
where first line can be another way to stream List<String>().stream(), ...
Am I wrong ?
Java 9
You may use Matcher#results() to get hold of all matches:
Stream<MatchResult>    results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.
Java 8 and lower
Another simple solution based on using a reverse pattern:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split).
See this demo:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

regex seems to be off for special characters (e.g. +-.,!##$%^&*;)

I am using regex to print out a string and adding a new line after a character limit. I don't want to split up a word if it hits the limit (start printing the word on the next line) unless a group of concatenated characters exceed the limit where then I just continue the end of the word on the next line. However when I hit special characters(e.g. +-.,!##$%^&*;) as you'll see when I test my code below, it adds an additional character to the limit for some reason. Why is this?
My function is:
public static String limiter(String str, int lim) {
str = str.trim().replaceAll(" +", " ");
str = str.replaceAll("\n +", "\n");
Matcher mtr = Pattern.compile("(.{1," + lim + "}(\\W|$))|(.{0," + lim + "})").matcher(str);
String newStr = "";
int ctr = 0;
while (mtr.find()) {
if (ctr == 0) {
newStr += (mtr.group());
ctr++;
} else {
newStr += ("\n") + (mtr.group());
}
}
return newStr ;
}
So my input is:
String str = " The 123456789 456789 +-.,!##$%^&*();\\/|<>\"\' fox jumpeded over the uf\n 2 3456 green fence ";
With a character line limit of 7.
It outputs:
456789 +
-.,!##$%
^&*();\/
|<>"
When the correct output should be:
456789
+-.,!##
$%^&*()
;\/|<>"
My code is linked to an online compiler you can run here:
https://ideone.com/9gckP1
You need to replace the (\W|$) with \b as your intention is to match whole words (and \b provides this functionality). Also, since you do not need trailing whitespace on newly created lines, you need to also use \s*.
So, use
Matcher mtr = Pattern.compile("(?U)(.{1," + lim + "}\\b\\s*)|(.{0," + lim + "})").matcher(str);
See demo
Note that (?U) is used here to "fix" the word boundary behavior to keep it in sync with \w (so that diacritics were not considered word characters).
In your pattern, \\W is part of the first capturing group. It is adding this one (non-word) character to the .{1,limit} pattern.
Try with: "(.{1," + lim + "})(\W|$)|(.{0," + lim + "})"
(I can't currently use your regex online compiler)

Get the last index of a letter followed by numeric

I'm trying to parse a URL and I'd like to test for the last index of a couple characters followed by a numeric value.
Example
used-cell-phone-albany-m3359_l12201
I'm trying to determine if the last "-m" is followed by a numeric value.
So something like this, "used-cell-phone-albany-m3359_l12201".contains("m" followed by numeric)
I'm assuming it needs to be done with regular expressions, but I'm not for sure.
You could use a pattern like [a-z]\\d which searches for any numbers which appear next to a character between a-z, you can specify other characters within the group if you wish...
Pattern pattern = Pattern.compile("[a-z]\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
while (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
The problem is, your test String actually contains two matches m3 and l1
The above example will display
23-25 = m3
29-31 = l1
Updated with feedback
If you can guarantee the marker (ie -m), then it comes a lot simpler...
Pattern pattern = Pattern.compile("-m\\d", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("used-cell-phone-albany-m3359_l12201");
if (matcher.find()) {
int startIndex = matcher.start();
int endIndex = matcher.end();
String match = matcher.group();
System.out.println(startIndex + "-" + endIndex + " = " + match);
}
In Java, convert the URL to a String if necessary and then run
URLString.match("^.*m[0-9]+$").
Only if that returns true, then the URL ends with "m" followed by a number. That can be refined with a more precise ending pattern. The reason this regex tests the pattern at the end of the string is because $ in a regex matches the end of the string; "[0-9]+" matches a sequencs of one or more numerical digits; "^" matches the beginning of the string; and ".*" matches zero or more arbitrary but printable characters including white space, letters, numbers and puctuation marks.
To determine if the last "m" is followed by a number then use
URLString.match("^.+?m[0-9].*$")
Here ".+?" greedily matches all characters up to the very last "m".

Why this code don't work properly?

Why this code:
String keyword = "pattern";
String text = "sometextpatternsometext";
String patternStr = "^.*" + keyword + ".*$"; //
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
System.out.println("start = " + start + ", end = " + end);
}
start = 0, end = 23
don't work properly.
But, this code:
String keyword = "pattern";
String text = "sometext pattern sometext";
String patternStr = "\\b" + keyword + "\\b"; //
Pattern pattern = Pattern.compile(patternStr, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
System.out.println("start = " + start + ", end = " + end);
}
start = 9, end = 16
work fine.
It does work. Your pattern
^.*pattern.*$
says to match:
start at the beginning
accept any number of characters
followed by the string pattern
followed by any number of characters
until the end of the string
The result is the entire input string. If you wanted to find only the word pattern, then the regex would be just the word by itself, or as you found, bracketed with word-boundary metacharacters.
It is not that the first example didn't work, it is that you inadvertently asked it to match more than you meant.
The .* expressions expand to contain all the characters before "pattern" and all the characters after pattern, so the whole expression matches the whole line.
With your second example, you only specify that it match a blank space before and after "pattern" so the expression matches mostly pattern, plus a couple of spaces.
The problem is in your regex: "^.*" + keyword + ".*$"
The expression .* matches as many characters as there are in the string. It means that it actually matches whole string. After the whole string it cannot find your keyword.
To make it working you have to make it greedy, i.e. add question sign after .*:
"^.*?" + keyword + ".*$"
This time .*? matches minimum characters followed by your keyword.

Categories