I need a regex that would ignore a token if a part of the token has already been captured before.
Example
var bold, det, bold=6, sum, k
Here, bold=6 should be ignored because bold has already been captured.
Also, var must be present before any matching can take place, the last token k should not be followed by a comma. Only the variables within var and the last token k should be followed by a comma.
Another Example
var bold=6, det, bold, sum, k
Here, bold which follows det should be ignored because bold=6 has already been captured.
i tried using this pattern (?:\\bvar\\b|\\G)\\s*(\\w+)(?:,|$), but it doesn't ignore what has been repeated.
Depends what information you need to get you can try with:
Solution working only in Java, will give you variable name and start
and end idices:
(?<=var.{0,999})(?<!=)(?!var\b)\b(?<var>\w+)\b(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999})
RegexPlanet Demo
it uses ugly, but quite effective feature of Java regex: intervals
(x{min,max}) in lookbehind. As long as you use interval with
minimal and maximal length, you can use it in Java regex. So instead
of .* you can use for example .{0,999}. It will fail if there
need to be more char than 999, you can use bigger number, but I
think it is not necessary in this cese. Named group <var> is optional here, you can replece it in code with normal group.
Implementation in Java:
public class Test{
public static void main(String[] args){
String test = "var bold, det, bold=6, sum, k\n" +
"var foo=6, abc, foo, xyz, k";
Matcher matcher = Pattern.compile("(?<=var.{0,999})(?<!=)(?!var)\\b(?<var>\\w+)\\b(?<!var.{1,999}(?=\\k<var>).{1,999}(?=\\k<var>).{1,999})").matcher(test);
while(matcher.find()){
System.out.println(matcher.group("var") + "," + matcher.start("var") + "," + matcher.end("var"));
}
}
}
with output (variable name, start index, end index):
bold,4,8
det,10,13
sum,23,26
k,28,29
foo,34,37
abc,41,44
xyz,51,54
k,56,57
Explanation of regex:
(?<=var.{0,999}) - must be preceded by text var followed by any
number of characters, but not new line,
(?<!=) - should not be preceded by equal sign, to avoid matching variable name and value as different matches,
(?!var\b) - cannot be followed by var word, to avoid matching this word,
\b(?<var>\w+)\b - separate word, captured into <var> group,
(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999}) - the matched word cannot by preceded by var word followed by some chars, including captured word, followed by some chars, inclusing captured word again,
But as I wrote, it will work only in Java.
If you need just variable names, you can use:
(?<=var\s|\G,\s)(?<var>\w+)(?=,|$)|(?<=var\s|\G,\s)(?<initialized>[^,\n]+)
DEMO
to get variable names without duplications. But if you want
start/end indices, it will capture into group second occurence of
duplicated variable name.
You can tweak your regex to this with a negative lookahead:
(?:\bvar\b|\G)\s*(?:(\w+)(?!.*\b\1\b)(?:=\w+)?|\S+)(?:,|\bk\b)
RegEx Demo
Rather than keeping track of what it has already matched it will skip matching a word if it is followed in rest of the string.
Here (?!.*\b\1\b) is a negative lookahead that will avoid matching a word if same word is found on RHS of input. \1 is back-reference of matched word.
RegEx Breakup:
(?:\bvar\b|\G) # match text var or \G
\s* # match 0 more spaces
(?: # start non-capturing group
(\w+)(?!.*\b\1\b) # match a word if same word is found in rest of the input
(?:=\w+)? # followed by optional = and some value
| # regex alternation
\S+ # OR match 1 or more non-space character
) # close non-capturing group
(?:,|\bk\b) # match a comma or k
Related
I want to determine a regular expression, but came up with only the below
Expression = ^[£]*|(,)|[0-9]p$
Test String = £4dj,gs,dl"34p
Issue: In my test string "4p" is coming as a match (which is near to my expection), but I want only "p" to be highlighted/matched, as I want to do a replace of p (and not the number),
Goal: replace p only if it comes after a number, without touching the number
Test String = ship
p in ship is not a match, which is valid for my scenario.
https://regex101.com/
Using ^[£]* matches optional repetitions of £ at the start of the string, and can also match an empty string.
If you only want to match the character to be replaced, you don't need the capture group around the single quote.
Your pattern, with the anchors:
^£|,|(?<=[0-9])p$
^£ Match a pound sign at the start of the string
| Or
, Match a comma
| Or
(?<=[0-9])p$ Match a p char at the end of the string asserting a digit to the left
Regex demo
Without the anchors to match all occurrences, you can combine the comma and pound sign in a character class.
[£,]|(?<=[0-9])p
Regex demo
I would like to find all possible occurrences of text enclosed between two ~s.
For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:
~*_abc~
~xyz~
~123~
Note it can be an alphabet or a digit.
I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.
Use capturing inside a positive lookahead with the following regex:
Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:
(?=(\w+))
At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD
The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.
The engine moves to the next position in the string, and the process repeats itself for CD then D.
So, use
(?=(~[^\s~]+~))
See the regex demo
The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.
Java demo:
String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]
Just in case someone needs a Python demo:
import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']
Try this [^~\s]*
This pattern doesn't consider the characters ~ and space (refered to as \s).
I have tested it, it works on your string, here's the demo.
I have this requirement - for an input string such as the one shown below
8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs
I would like to strip the matched word boundaries (where the matching pair is 8 or & or % etc) and will result in the following
This is really a test of repl%acing %mul%tiple matched 9pairs
This list of characters that is used for the pairs can vary e.g. 8,9,%,# etc and only the words matching the start and end with each type will be stripped of those characters, with the same character embedded in the word remaining where it is.
Using Java I can do a pattern as \\b8([^\\s]*)8\\b and replacement as $1, to capture and replace all occurrences of 8...8, but how do I do this for all the types of pairs?
I can provide a pattern such as \\b8([^\\s]*)8\\b|\\b9([^\\s]*)9\\b .. and so on that will match all types of matching pairs *8,9,..), but how do I specify a 'variable' replacement group -
e.g. if the match is 9...9, the the replacement should be $2.
I can of course run it through multiple of these, each replacing a specific type of pair, but I am wondering if there is a more elegant way.
Or is there a completely different way of approaching this problem?
Thanks.
You could use the below regex and then replace the matched characters by the characters present inside the group index 2.
(?<!\S)(\S)(\S+)\1(?=\s|$)
OR
(?<!\S)(\S)(\S*)\1(?=\s|$)
Java regex would be,
(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)
DEMO
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)(\\S)(\\S+)\\1(?=\\s|$)", "$2"));
Output:
This is reallly a test of repl%acing %mul%tiple matched 9pairs
Explanation:
(?<!\\S) Negative lookbehind, asserts that the match wouldn't be preceded by a non-space character.
(\\S) Captures the first non-space character and stores it into group index 1.
(\\S+) Captures one or more non-space characters.
\\1 Refers to the character inside first captured group.
(?=\\s|$) And the match must be followed by a space or end of the line anchor.
This makes sure that the first character and last character of the string must be the same. If so, then it replaces the whole match by the characters which are present inside the group index 2.
For this specific case, you could modify the above regex as,
String s1 = "8This8 is &reallly& a #test# of %repl%acing% %mul%tiple 9matched9 9pairs";
System.out.println(s1.replaceAll("(?<!\\S)([89&#%])(\\S+)\\1(?=\\s|$)", "$2"));
DEMO
(?<![a-zA-Z])[8&#%9](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[8&#%9](?![a-zA-Z])
Try this.Replace with $1 or \1.See demo.
https://regex101.com/r/qB0jV1/15
(?<![a-zA-Z])[^a-zA-Z](?=[a-zA-Z])([^\s]*?)(?<=[a-zA-Z])[^a-zA-Z](?![a-zA-Z])
Use this if you have many delimiters.
I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F
How can i write this as a regular expression?
"blocka#123#456"
i have used # symbol to split the parameters in the data
and the parameters are block name,startX coordinate,start Y corrdinate
this is the data embedded in my QR code.so when i scan the QR i want to check if its the right QR they're scanning. For that i need a regular expression for the above syntax.
my method body
public void Store_QR(String qr){
if( qr.matches(regular Expression here)) {
CurrentLocation = qr;
}
else // Break the operation
}
The Information you specified does not justice using a regular expression at all.
Try to from it in a more general way.
If you really need to scan for "blocka#123#456" then use qr.contains("blocka#123#456");
It depends on what you want to match.
Here are some regex propositions:
^blocka#[0-9]{3}#[0-9]{3}$
^blocka#[0-9]+#[0-9]+$
^blocka(#[0-9]{3}){2}$
^blocka(#[0-9]+){2}$
^blocka(#[0-9]{3})+$
^blocka(#[0-9]+)+$
Otherwise, just use contains() or similar.
myregexp.com is nice to do some testing.
Official Java Regex Tutorial is quite ok to learn and includes most things one needs to know.
The Pattern documentation also includes fancy predefined character classes that are missing in above tutorial.
You did not specify anything that has to be regular in that example you gave. Regular expressions make only sense if there are rules to validate the input.
If it has to be exactly "blocka#123#456" then "blocka#123#456" or "^blocka#123#456$" will work as regex. Stuff between ^ and $ means that the regex inside must span from begin to end of the input. Sometimes required and usually a good idea to put that around your regex.
If blocka is dynamic replace it with [a-z]+ to match any sequence of lowercase letters a through z with length of at least 1. block[a-z] would match blocka, blockb, etc.
And [a-z]{6} would match any sequence of exactly 6 letters. [a-zA-Z] also includes uppercase letters and \p{L} matches any letter including unicode stuff (e.g. Blüc本).
# matches #. Like any character without special regex meaning ( \ ^ $ . | ? * + ( ) [ ] { } ) characters match themselves. [^#] matches every character but #.
Regarding the numbers: [0-9]+ or \d+ is a generic pattern for several numbers, [0-9]{1,4} would match anything consisting out of 1-4 numbers like 007, 5, 9999. (?:0|[1-9][0-9]{0,3}) for example will only match numbers between 0 and 9999 and does not allow leading zeros. (?:STUFF) is a non-capturing group that does not affect the groups you can extract via Matcher#group(1..?). Useful for logical grouping with |. The meaning of (?:0|[1-9][0-9]{0,3}) is: either a single 0 OR ( 1x 1-9 followed by 0 to 3 x 0-9).
[0-9] is so common that there is a predefinition for it : \d (digit). It's \\d inside the regex String since you have to escape the \.
So some of your options are
".*" which matches absolutely everything
"^[^#]+(?:#[^#]+)+$" which matches anything separated by # like "hello #world!1# -12.f #本#foo#bar"
"^blocka(#\\d+)+$" which matches blocka followed by at least one group of numbers separated by # e.g. blocka#1#12#0007#949432149#3
"^blocka#(?:[0-9]|[1-9][0-9]|[1-3][0-9]{2})#[4-9][0-9]{2}$" which will match only if it finds blocka# followed by numbers 0 - 399, followed by a # and finally numbers 400-999
"^blocka#123#456$" which matches only exactly that string.
All that are regular expressions that match the example you gave.
But it's probably as simple as
public void Store_QR(String qr){
if( qr.matches("^blocka#\\d+#\\d+$")) {
CurrentLocation = qr;
}
else // Break the operation
}
or
private static final Pattern QR_PATTERN = Pattern.compile("^blocka#(\\d+)#(\\d+)$");
public void Store_QR(String qr){
Matcher matcher = QR_PATTERN.matcher(qr);
if(matcher.matches()) {
int number1 = Integer.valueOf(matcher.group(1));
int number2 = Integer.valueOf(matcher.group(2));
CurrentLocation = qr;
}
else // Break the operation
}
BlockName#start_X#start_Y any block name.. starting with the string"block" and followed by two integers
I guess a good regex for that would be "^block\\w+#\\d+#\\d+$", starting with "block", then any combination of a-z, A-Z, 0-9 and _ (thats the \w) followed by #, numbers, #, numbers.
Would match block_#0#0, blockZ#9#9, block_a_Unicorn666#0000#1234, but not block#1#2 because there is no name at all and would not match blockName#123#abc because letters instead of number. Would also not match Block_a#123#456 because of the uppercase B.
If the name part (\\w+) is too liberal (___, _123 would be a legal names) use e.g. "^block_?[a-zA-Z]+#\\d+#\\d+$", what won't allow numbers and names may only be separated by a single optional _ and there have to be letters after that. Would allow _a, a, _ABc, but not _, _a_b, _a9. If you want to allow numbers in names [a-zA-Z0-9] would be the character class to use.
I suggest:
[a-z]+#\d+#\d+
And if you want capture the 3 parts:
([a-z]+)#(\d+)#(\d+)
Matcher.group( 1, 2 or 3 ) returns the parts