Delete regex matches placed inside other regex matches - java

I have two regexes. I want to delete all matches of second one if they are placed inside matches of first one. Basically, nothing can be matched in what was already matched. Example:
First regex (bold) - c\w+ finds words beginning with c
Second regex (underlined) - me finds me
Result: cam̲e̲l crim̲e̲ care cool m̲e̲dium m̲e̲lt hom̲e̲
The me in c-words are matched too. Want I want is: camel crime care cool m̲e̲dium m̲e̲lt hom̲e̲
Two results of second regex are in results of first regex, I want to delete them, or just don't match them at all. Here's what I tried:
String text = "camel crime care cool medium melt home";
static final Pattern PATTERN_FIRST = Pattern.compile("c\w+");
static final Pattern PATTERN_SECOND = Pattern.compile("me");
// Save all matches
List<int[]> firstRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_FIRST.matcher(text); m.find();) {
firstRegexMatches.add(new int[]{m.start(), m.end()});
}
List<int[]> secondRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_SECOND.matcher(text); m.find();) {
secondRegexMatches.add(new int[]{m.start(), m.end()});
}
// Remove matches of second inside matches of first
for (int[] pos : firstRegexMatches) {
Iterables.removeIf(secondRegexMatches, p -> p[0] > pos[0] && p[1] < pos[1]);
}
In this code I store all matches of both into list then try to remove from the second list matches placed inside first list matches.
Not only does this not work, but I'm not sure it's very efficient. Note that this a simplified version of my situation, which contains more regexes and a large text. Iterables is from Guava.

First of all you can achieve something like this merging both expressions into one.
(^c\w+)|\s(c\w+)|(\w*me\w*)
If you match against this regex every match will be either a word starting with "c" followed by some word-characters or a word containing "me". For every match you then either get the group:
(1) or (2) indicating a word starting with "c" or
(3) indicating a word containing "me"
However note that this only works in case you know the delimiter of the words, in this case a \s character.
Example code:
String text = "camel crime care cool medium melt home";
final Pattern PATTERN = Pattern.compile("(^c\\w+)|\\s(c\\w+)|(\\w*me\\w*)");
// Save all matches
List<String> wordsStartingWithC = new ArrayList<>();
List<String> wordsIncludingMe = new ArrayList<>();
for (Matcher m = PATTERN.matcher(text); m.find();) {
if(m.group(1) != null) {
wordsStartingWithC.add(m.group(1));
} else if(m.group(2) != null) {
wordsStartingWithC.add(m.group(2));
} else if(m.group(3) != null) {
wordsIncludingMe.add(m.group(3));
}
}
System.out.println(wordsStartingWithC);
System.out.println(wordsIncludingMe);
I'd recommend to simplify this by taking a somewhat different approach.
As you seem to know the word limiter, namely the whitespace character, you can get a collection of all words simply by splitting the original string.
String[] words = "camel crime care cool medium melt home".split(" ");
You then simply iterate over all of these.
for(String word: words) {
if(word.startsWith("c")) {
// put in your list for words starting with "c"
} else if (word.contains("me")) {
// put in your list for words containing "me"
}
}
This will result in two lists without duplicate entries, as the second if statement will only be executed in case the first one fails.

Isn't it possible to combine the two Regexes? For example, the me after c can be found using one Regex with this code:
((?<=c)|(?<=c\w)|(?<=c\w{2})|(?<=c\w{3})|(?<=c\w{4})|(?<=c\w{5}))me
Check it out here: https://regex101.com/r/bfNkvF/2

Related

Java Splitting Strings with several conditions

I want to split a string along several different conditions -
I understand there is a Java String method called String.split(element), which splits the String into an array based on the element specified.
However, splitting among more objects seems to be very complex -- especially if the split must occur to a range of elements.
Precisely, I want java to split the string
"a>=b" into {"a",">=","b"}
"a>b" into {"a", ">", "b"}
"a==b" into {"a","==","b"}
I have been fiddling around with regex too just to see how to split it exactly based on this parameters, but the closest I've gotten is just splitting along a single character.
EDIT: a and b are arbitrary Strings that can be of any length. I simply want to split along the different kinds of comparators ">",">=","==";
For example, a could be "Apple" and b could be "Orange".
So in the end I want the String from "Apple>=Orange" into
{"Apple", ">=", "Orange"}
You can use regular expressions. No matter if you use a, or b or abc for your variables you'll get the first variable in the group 1, the condition in the group 2 and the second variable in the group 3.
Pattern pattern = Pattern.compile("(\\w+)([<=>]+)(\\w+)");
Matcher matcher = pattern.matcher("var1>=ar2b");
if(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
System.out.println(matcher.group(3));
}
The following code works for your examples:
System.out.println(Arrays.asList("a<=b".split("\\b")));
It splits the string on word boundaries.
If you need more elaborate splitting, you have to provide more examples.
You could code it out by hand and use whichever tokens you want to split on like so
public String[] splitString(String word)
{
String[] pieces;
String[] tokens = {"==", ">=", "<=","<", ">"};
for(int i = 0; i < tokens.length; i++)
{
if(word.contains(tokens[i]))
{
pieces = {
word.substring(0, word.indexOf(tokens[i])),
tokens[i],
word.substring(word.indexOf(tokens[i]) +
tokens[i].length(), word.length())};
return pieces;
}
}
return pieces;
}
This will return an array with whatever is before the token found, the token itself and whatever is left.

Regex - Words containing multiple underscores

I'm making a Lexer, and have chosen to use Regex to split my tokens.
I'm working on all different tokens, except the one that really bugs me is words and identifiers.
You see, the rules I have in place are the following:
Words cannot start with or end with an underscore.
Words can be one or more characters in length.
Underscores can only be used between letters, and can appear multiple times.
Example of what I want:
_foo <- Invalid.
foo_ <- Invalid.
_foo_ <- Invalid.
foo_foo <- Valid.
foo_foo_foo <- Valid.
foo_foo_ <- Partially Valid. Only "foo_foo" should be picked up.
_foo_foo <- Partially Valid. Only "foo_foo" should be picked up.
I'm getting close, as this is what I currently have:
([a-zA-Z]+_[a-zA-Z]+|[a-zA-Z]+)
Except, it only detects the first occurence of an underscore. I want all of them.
Personal Request:
I would rather the answer be contained inside of a single group, as I have structured my tokeniser around them, except I would be more than happy to change my design if you can think of a better way of handling it. This is what I currently use:
private void tokenise(String regex, String[] data) {
Set<String> tokens = new LinkedHashSet<String>();
Pattern pattern = Pattern.compile(regex);
// First pass. Uses regular expressions to split data and catalog token types.
for (String line : data) {
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
for (int i = 1; i < matcher.groupCount() + 1; i++) {
if (matcher.group(i) != null) {
switch(i) {
case (1):
// Example group.
// Normally I would structure like:
// 0: Identifiers
// 1: Strings
// 2-?: So on so forth.
tokens.add("FOO:" + matcher.group());
break;
}
}
}
}
}
}
Try ([a-zA-Z]+(?:_[a-zA-Z]+)*)
The first part of the pattern, [a-zA-Z]+, matches one or more letters.
The second part of the pattern, (?:_[a-zA-Z]+), matches an undescore if it is followed by one or more letters.
The * at the end means the second part can be repeated zero or more times.
The (?: ) is like plain (), but doesn't return the matched group.

Matching and sorting a Bukkit ChatColor expression

I'm splitting up a String by spaces and then checking each piece if it contains a code (&a, &l, etc). If it matches, I have to grab the codes that are beside each other and then order them alphanumerically (0, 1, 2... a, b, c...).
Here is what I tried so far:
String message = "&l&aCheckpoint&m&6doreime";
String[] parts = message.split(" "); // This may not be needed for the example, but I'm only using one word for simplicity here
List<String> orderedMessage = new ArrayList<>();
Pattern pattern = Pattern.compile("((?:&|\u00a7)[0-9A-FK-ORa-fk-or])(.*?)"); // Completely matches the entire pattern, not what i want
for (String part : parts) {
if (pattern.matcher(part).matches()) {
List<String> orderedParts = new ArrayList<>();
// what do i do?
}
}
I need to change the pattern value so it matches groups like this:
Match: &l&aCheckpoint
Groups that I need: [&l, &a, Checkpoint]
Match: &m&6doreime
Groups that I need: [&m, &6, doreime]
How can I match each shown Match and split it into the 3 groups (where it splits each code section (&[0-9A-FK-ORa-fk-or]) and the remaining text until another code section?
Info: For anyone who is wondering why, when you submit color/format coded text to Minecraft, colors have to come first, or the format ([a-fk-or]) codes are ignored because of how Minecraft has parsed color codes since 1.5. By sorting them and rebuilding the message, it won't rely on users or developers getting the order correct.
You can get what you are after by using a slightly more complicated regex
(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)
Breaking it down we have two important capturing groups
(((?:&|§)[0-9A-FK-ORa-fk-or])+)
This will capture one or more code sections of and & followed by a character
([^&]*)
The second grabs any number of non & characters which will get you the remainder of that section. (This is slightly different behavior than the regex you provided - things more complicated if & is a legal character in the string.
Putting that regex into use with a Matcher you can do the following,
String input = "&l&aCheckpoint&m&6doreime";
Pattern pattern = Pattern.compile("(((?:&|§)[0-9A-FK-ORa-fk-or])+)([^&]*)");
Matcher patternMatcher = pattern.matcher(input);
while(patternMatcher.find()){
String[] codes = patternMatcher.group(1).split("(?=&)");
String rest = patternMatcher.group(3);
}
Which will loop twice, giving you
codes = ["&l", "&a"]
rest = "Checkpoint"
on the first loop and the following on the second
codes = ["&m", "&6"]
rest = "doreime"

Multiple matches with delimiter

this is my regex:
([+-]*)(\\d+)\\s*([a-zA-Z]+)
group no.1 = sign
group no.2 = multiplier
group no.3 = time unit
The thing is, I would like to match given input but it can be "chained". So my input should be valid if and only if the whole pattern is repeating without anything between those occurrences (except of whitespaces). (Only one match or multiple matches next to each other with possible whitespaces between them).
valid examples:
1day
+1day
-1 day
+1day-1month
+1day +1month
+1day +1month
invalid examples:
###+1day+1month
+1day###+1month
+1day+1month###
###+1day+1month###
###+1day+1month###
I my case I can use matcher.find() method, this would do the trick but it will accept input like this: +1day###+1month which is not valid for me.
Any ideas? This can be solved with multiple IF conditions and multiple checks for start and end indexes but I'm searching for elegant solution.
EDIT
The suggested regex in comments below ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$ will partially do the trick but if I use it in the code below it returns different result than the result I'm looking for.
The problem is that I cannot use (*my regex*)+ because it will match the whole thing.
The solution could be to match the whole input with ^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$and then use ([+-]*)(\\d+)\\s*([a-zA-Z]+)with matcher.find() and matcher.group(i) to extract each match and his groups. But I was looking for more elegant solution.
This should work for you:
^\s*(([+-]*)(\d+)\s*([a-zA-Z]+)\s*)+$
First, by adding the beginning and ending anchors (^ and $), the pattern will not allow invalid characters to occur anywhere before or after the match.
Next, I included optional whitespace before and after the repeated pattern (\s*).
Finally, the entire pattern is enclosed in a repeater so that it can occur multiple times in a row ((...)+).
On a side, note, I'd also recommend changing [+-]* to [+-]? so that it can only occur once.
Online Demo
You could use ^$ for that, to match the start/end of string
^\s*(?:([+-]?)(\d+)\s*([a-z]+)\s*)+$
https://regex101.com/r/lM7dZ9/2
See the Unit Tests for your examples. Basically, you just need to allow the pattern to repeat and force that nothing besides whitespace occurs in between the matches.
Combined with line start/end matching and you're done.
You can use String.matches or Matcher.matches in Java to match the entire region.
Java Example:
public class RegTest {
public static final Pattern PATTERN = Pattern.compile(
"(\\s*([+-]?)(\\d+)\\s*([a-zA-Z]+)\\s*)+");
#Test
public void testDays() throws Exception {
assertTrue(valid("1 day"));
assertTrue(valid("-1 day"));
assertTrue(valid("+1day-1month"));
assertTrue(valid("+1day -1month"));
assertTrue(valid(" +1day +1month "));
assertFalse(valid("+1day###+1month"));
assertFalse(valid(""));
assertFalse(valid("++1day-1month"));
}
private static boolean valid(String s) {
return PATTERN.matcher(s).matches();
}
}
You can proceed like this:
String p = "\\G\\s*(?:([-+]?)(\\d+)\\s*([a-z]+)|\\z)";
Pattern RegexCompile = Pattern.compile(p, Pattern.CASE_INSENSITIVE);
String s = "+1day 1month";
ArrayList<HashMap<String, String>> results = new ArrayList<HashMap<String, String>>();
Matcher m = RegexCompile.matcher(s);
boolean validFormat = false;
while( m.find() ) {
if (m.group(1) == null) {
// if the capture group 1 (or 2 or 3) is null, it means that the second
// branch of the pattern has succeeded (the \z branch) and that the end
// of the string has been reached.
validFormat = true;
} else {
// otherwise, this is not the end of the string and the match result is
// "temporary" stored in the ArrayList 'results'
HashMap<String, String> result = new HashMap<String, String>();
result.put("sign", m.group(1));
result.put("multiplier", m.group(2));
result.put("time_unit", m.group(3));
results.add(result);
}
}
if (validFormat) {
for (HashMap item : results) {
System.out.println("sign: " + item.get("sign")
+ "\nmultiplier: " + item.get("multiplier")
+ "\ntime_unit: " + item.get("time_unit") + "\n");
}
} else {
results.clear();
System.out.println("Invalid Format");
}
The \G anchor matches the start of the string or the position after the previous match. In this pattern, it ensures that all matches are contigous. If the end of the string is reached, it's a proof that the string is valid from start to end.

Regex - extract indefinite number of hits

The method getPolygonPoints() (see below) becomes a String name as parameter, which looks something like this:
points={{-100,100},{-120,60},{-80,60},{-100,100},{-100,100}}
The first number stands for the x-coordinate, the second for the y coordinate. For example,the first point is
x=-100
y=100
The second point is
x=-120
y=60
and so on.
Now I want to extract the points of the String and put them in a ArrayList, which has to look like this at the end:
[-100, 100, -120, 60, -80, 60, -100, 100, -100, 100]
The special feature here is, that the number of points in the given String changes and is not always the same.
I have written the following code:
private ArrayList<Integer> getPolygonPoints(String name) {
// the regular expression
String regGroup = "[-]?[\\d]{1,3}";
// compile the regular expression into a pattern
Pattern regex = Pattern.compile("\\{(" + regGroup + ")");
// the mather
Matcher matcher;
ArrayList<Integer> points = new ArrayList<Integer>();
// matcher that will match the given input against the pattern
matcher = regex.matcher(name);
int i = 1;
while(matcher.find()) {
System.out.println(Integer.parseInt(matcher.group(i)));
i++;
}
return points;
}
The first x coordinate is extracted correctly, but then a IndexOutOfBoundsException is thrown. I think that happens, because group 2 is not defined.
I think at first I have to count the points and then iterate over this number. Inside of the iteration I would put the int values in the ArrayList with a simple add(). But I don't know how to do this. Maybe I don't understand the regex part at this point. Especially how the groups work.
Please help!
String points = "{{-100,100},{-120,60},{-80,60},{-100,100},{-100,100}}";
String[] strs = points.replaceAll("(\\{|\\})", "").split(",");
ArrayList<Integer> list = new ArrayList<Integer>(strs.length);
for (String s : strs)
{
list.add(Integer.valueOf(s));
}
The part you don't seem to understand about the regex API is that the capture group number "reset" with every call to find(). Or, to put it another way: the number of the capture group is its position in the pattern, not in the input string.
You're also going about this the wrong way. You should match the whole construct you're looking for, in this case the {x,y} pairs. I'm assuming you don't want to validate the format of the whole string, so we can ignore the outside brackets and comma:
Pattern p = Pattern.compile("\\{(-?\\d+),(-?\\d+)\\}");
Matcher m = p.matcher(name);
while (m.find()) {
String x = m.group(1);
String y = m.group(2);
// parse and add to list
}
Alternately, since you don't care about which coordinate is X and which is Y, you can even do:
Matcher m = Pattern.compile("-?\\d+").matcher(name);
while (m.find()) {
String xOrY = m.group();
// parse etc.
}
Now, if you want to validate the input as well, I'd say that's a separate concern, I wouldn't necessarily try to do it in the same step as the parsing to keep the regex readable. (It might be possible in this case but if you don't need it why bother in the first place.)
You can also try this regex:
((-?\d+)\s*,\s*(-?\d+))
It will give you three groups:
Group 1 : x
Group 2 : y
Group 3 : x,y
You can use which one is required to you.
How about doing it in just one line:
List<String> list = Arrays.asList(name.replaceAll("(^\\w+=\\{+)|(\\}+$)", "").split("\\{?,\\}?"));
Your whole method would then be:
private ArrayList<Integer> getPolygonPoints(String name) {
return new ArrayList<String>(Arrays.asList(name.replaceAll("(^\\w+=\\{+)|(\\}+$)", "").split("\\{?,\\}?")));
}
This works by first stripping off the leading and trailing text, then splits on commas optionally surrounded by braces.
BTW You really should return the abstract type List, not the concrete implementation ArrayList.

Categories