Regex to find all possible occurrences of text starting and ending with ~ - java

I would like to find all possible occurrences of text enclosed between two ~s.
For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:
~*_abc~
~xyz~
~123~
Note it can be an alphabet or a digit.
I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.

Use capturing inside a positive lookahead with the following regex:
Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:
(?=(\w+))
At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
So, use
(?=(~[^\s~]+~))
See the regex demo
The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.
Java demo:
String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]
Just in case someone needs a Python demo:
import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']

Try this [^~\s]*
This pattern doesn't consider the characters ~ and space (refered to as \s).
I have tested it, it works on your string, here's the demo.

Related

Regular expression - How to match only character after number, without including number in match pattern

I want to determine a regular expression, but came up with only the below
Expression = ^[£]*|(,)|[0-9]p$
Test String = £4dj,gs,dl"34p
Issue: In my test string "4p" is coming as a match (which is near to my expection), but I want only "p" to be highlighted/matched, as I want to do a replace of p (and not the number),
Goal: replace p only if it comes after a number, without touching the number
Test String = ship
p in ship is not a match, which is valid for my scenario.
https://regex101.com/
Using ^[£]* matches optional repetitions of £ at the start of the string, and can also match an empty string.
If you only want to match the character to be replaced, you don't need the capture group around the single quote.
Your pattern, with the anchors:
^£|,|(?<=[0-9])p$
^£ Match a pound sign at the start of the string
| Or
, Match a comma
| Or
(?<=[0-9])p$ Match a p char at the end of the string asserting a digit to the left
Regex demo
Without the anchors to match all occurrences, you can combine the comma and pound sign in a character class.
[£,]|(?<=[0-9])p
Regex demo

Regex to ignore variable initializations that have already been declared

I need a regex that would ignore a token if a part of the token has already been captured before.
Example
var bold, det, bold=6, sum, k
Here, bold=6 should be ignored because bold has already been captured.
Also, var must be present before any matching can take place, the last token k should not be followed by a comma. Only the variables within var and the last token k should be followed by a comma.
Another Example
var bold=6, det, bold, sum, k
Here, bold which follows det should be ignored because bold=6 has already been captured.
i tried using this pattern (?:\\bvar\\b|\\G)\\s*(\\w+)(?:,|$), but it doesn't ignore what has been repeated.
Depends what information you need to get you can try with:
Solution working only in Java, will give you variable name and start
and end idices:
(?<=var.{0,999})(?<!=)(?!var\b)\b(?<var>\w+)\b(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999})
RegexPlanet Demo
it uses ugly, but quite effective feature of Java regex: intervals
(x{min,max}) in lookbehind. As long as you use interval with
minimal and maximal length, you can use it in Java regex. So instead
of .* you can use for example .{0,999}. It will fail if there
need to be more char than 999, you can use bigger number, but I
think it is not necessary in this cese. Named group <var> is optional here, you can replece it in code with normal group.
Implementation in Java:
public class Test{
public static void main(String[] args){
String test = "var bold, det, bold=6, sum, k\n" +
"var foo=6, abc, foo, xyz, k";
Matcher matcher = Pattern.compile("(?<=var.{0,999})(?<!=)(?!var)\\b(?<var>\\w+)\\b(?<!var.{1,999}(?=\\k<var>).{1,999}(?=\\k<var>).{1,999})").matcher(test);
while(matcher.find()){
System.out.println(matcher.group("var") + "," + matcher.start("var") + "," + matcher.end("var"));
}
}
}
with output (variable name, start index, end index):
bold,4,8
det,10,13
sum,23,26
k,28,29
foo,34,37
abc,41,44
xyz,51,54
k,56,57
Explanation of regex:
(?<=var.{0,999}) - must be preceded by text var followed by any
number of characters, but not new line,
(?<!=) - should not be preceded by equal sign, to avoid matching variable name and value as different matches,
(?!var\b) - cannot be followed by var word, to avoid matching this word,
\b(?<var>\w+)\b - separate word, captured into <var> group,
(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999}) - the matched word cannot by preceded by var word followed by some chars, including captured word, followed by some chars, inclusing captured word again,
But as I wrote, it will work only in Java.
If you need just variable names, you can use:
(?<=var\s|\G,\s)(?<var>\w+)(?=,|$)|(?<=var\s|\G,\s)(?<initialized>[^,\n]+)
DEMO
to get variable names without duplications. But if you want
start/end indices, it will capture into group second occurence of
duplicated variable name.
You can tweak your regex to this with a negative lookahead:
(?:\bvar\b|\G)\s*(?:(\w+)(?!.*\b\1\b)(?:=\w+)?|\S+)(?:,|\bk\b)
RegEx Demo
Rather than keeping track of what it has already matched it will skip matching a word if it is followed in rest of the string.
Here (?!.*\b\1\b) is a negative lookahead that will avoid matching a word if same word is found on RHS of input. \1 is back-reference of matched word.
RegEx Breakup:
(?:\bvar\b|\G) # match text var or \G
\s* # match 0 more spaces
(?: # start non-capturing group
(\w+)(?!.*\b\1\b) # match a word if same word is found in rest of the input
(?:=\w+)? # followed by optional = and some value
| # regex alternation
\S+ # OR match 1 or more non-space character
) # close non-capturing group
(?:,|\bk\b) # match a comma or k

Replace multiple capture groups using regexp with java

I have this requirement - for an input string such as the one shown below
8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs
I would like to strip the matched word boundaries (where the matching pair is 8 or & or % etc) and will result in the following
This is really a test of repl%acing %mul%tiple matched 9pairs
This list of characters that is used for the pairs can vary e.g. 8,9,%,# etc and only the words matching the start and end with each type will be stripped of those characters, with the same character embedded in the word remaining where it is.
Using Java I can do a pattern as \\b8([^\\s]*)8\\b and replacement as $1, to capture and replace all occurrences of 8...8, but how do I do this for all the types of pairs?
I can provide a pattern such as \\b8([^\\s]*)8\\b|\\b9([^\\s]*)9\\b .. and so on that will match all types of matching pairs *8,9,..), but how do I specify a 'variable' replacement group -
e.g. if the match is 9...9, the the replacement should be $2.
I can of course run it through multiple of these, each replacing a specific type of pair, but I am wondering if there is a more elegant way.
Or is there a completely different way of approaching this problem?
Thanks.
You could use the below regex and then replace the matched characters by the characters present inside the group index 2.
(?<!\S)(\S)(\S+)\1(?=\s|$)
OR
(?<!\S)(\S)(\S*)\1(?=\s|$)
Java regex would be,
(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)
DEMO
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)", "$2"));
Output:
This is reallly a test of repl%acing %mul%tiple matched 9pairs
Explanation:
(?<!\\S) Negative lookbehind, asserts that the match wouldn't be preceded by a non-space character.
(\\S) Captures the first non-space character and stores it into group index 1.
(\\S+) Captures one or more non-space characters.
\\1 Refers to the character inside first captured group.
(?=\\s|$) And the match must be followed by a space or end of the line anchor.
This makes sure that the first character and last character of the string must be the same. If so, then it replaces the whole match by the characters which are present inside the group index 2.
For this specific case, you could modify the above regex as,
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)([89&#%])(\\S+)\\1(?=\\s|$)", "$2"));
DEMO
(?<![a-zA-Z])[8&#%9](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[8&#%9](?![a-zA-Z])
Try this.Replace with $1 or \1.See demo.
https://regex101.com/r/qB0jV1/15
(?<![a-zA-Z])[^a-zA-Z](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[^a-zA-Z](?![a-zA-Z])
Use this if you have many delimiters.

validate string in java

I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F

Java - Extract strings with Regex

I've this string
String myString ="A~BC~FGH~~zuzy|XX~ 1234~ ~~ABC~01/01/2010 06:30~BCD~01/01/2011 07:45";
and I need to extract these 3 substrings
1234
06:30
07:45
If I use this regex \\d{2}\:\\d{2} I'm only able to extract the first hour 06:30
Pattern depArrHours = Pattern.compile("\\d{2}\\:\\d{2}");
Matcher matcher = depArrHours.matcher(myString);
String firstHour = matcher.group(0);
String secondHour = matcher.group(1); (IndexOutOfBoundException no Group 1)
matcher.group(1) throws an exception.
Also I don't know how to extract 1234. This string can change but it always comes after 'XX~ '
Do you have any idea on how to match these strings with regex expressions?
UPDATE
Thanks to Adam suggestion I've now this regex that match my string
Pattern p = Pattern.compile(".*XX~ (\\d{3,4}).*(\\d{1,2}:\\d{2}).*(\\d{1,2}:\\d{2})";
I match the number, and the 2 hours with matcher.group(1); matcher.group(2); matcher.group(3);
The matcher.group() function expects to take a single integer argument: The capturing group index, starting from 1. The index 0 is special, which means "the entire match". A capturing group is created using a pair of parenthesis "(...)". Anything within the parenthesis is captures. Groups are numbered from left to right (again, starting from 1), by opening parenthesis (which means that groups can overlap). Since there are no parenthesis in your regular expression, there can be no group 1.
The javadoc on the Pattern class covers the regular expression syntax.
If you are looking for a pattern that might recur some number of times, you can use Matcher.find() repeatedly until it returns false. Matcher.group(0) once on each iteration will then return what matched that time.
If you want to build one big regular expression that matches everything all at once (which I believe is what you want) then around each of the three sets of things that you want to capture, put a set of capturing parenthesis, use Matcher.match() and then Matcher.group(n) where n is 1, 2 and 3 respectively. Of course Matcher.match() might also return false, in which case the pattern did not match, and you can't retrieve any of the groups.
In your example, what you probably want to do is have it match some preceding text, then start a capturing group, match for digits, end the capturing group, etc...I don't know enough about your exact input format, but here is an example.
Lets say I had strings of the form:
Eat 12 carrots at 12:30
Take 3 pills at 01:15
And I wanted to extract the quantity and times. My regular expression would look something like:
"\w+ (\d+) [\w ]+ (\d{1,2}:\d{2})"
The code would look something like:
Pattern p = Pattern.compile("\\w+ (\\d+) [\\w ]+ (\\d{2}:\\d{2})");
Matcher m = p.matcher(oneline);
if(m.matches()) {
System.out.println("The quantity is " + m.group(1));
System.out.println("The time is " + m.group(2));
}
The regular expression means "a string containing a word, a space, one or more digits (which are captured in group 1), a space, a set of words and spaces ending with a space, followed by a time (captured in group 2, and the time assumes that hour is always 0-padded out to 2 digits). I would give a closer example to what you are looking for, but the description of the possible input is a little vague.

Categories