How do I write a regular expression for these path expressions - java

I'm trying to write a helper method that breaks down path expressions and would love some help. Please consider a path pattern like the following four (round brackets indicate predicates):
item.sub_element.subsubelement(#key = string) ; or,
item..subsub_element(#key = string) ; or,
//subsub_element(#key = string) ; or,
item(#key = string)
what would a regular expression look like that matches those?
What I have come up with is this:
((/{2}?[\\w+_*])(\\([_=##\\w+\\*\\(\\)\\{\\}\\[\\]]*\\))?\\.{0,2})+
I'm reading this as: "match one or more occurences of a string that consists of two groups: group one consists of one or more words with optional underscores and optional double forward slash prefix ; group two is optional and consists of at least one word with all other characters optional ; groups are trailed by zero to two dots."
However, a test run on the fourth example with Matcher.matches() returns false. So, where's my error?
Any ideas?
TIA,
FK
Edit: from trying with http://www.regexplanet.com/simple/index.html it seems I wasn't aware of the difference between the Matcher.matches() and the Matcher.find() methods of the Matcher object. I was trying to break down the input string in to substrings that match my regex. Consequently I need to use find(), not matches().
Edit2: This does the trick
([a-zA-Z0-9_]+)\.{0,2}(\(.*\))?

You misunderstand character classes, I think. I've found that for testing regular expressions, http://gskinner.com/RegExr/ is of great help. As a tutorial for regular expressions, I'd recommend http://www.regular-expressions.info/tutorial.html.
I am not entirely sure, how you want to group your strings. Your sentence seems to suggest, that your first group is just the item part of item..subsub_element(#key = string), but then I am not sure what the second group should be. Judging from what I deduce from your Regex, I'll just group the part before the brackets into group one, and the part in the brackets into group two. You can surely modify this if I misunderstood you.
I don't escape the expression for Java, so you'd have to do that. =)
The first group should begin with an optional double slash. I use
(?://)?. Here ?: means that this part should not be captured, and the last ? makes the group before it optional.
Following that, there are words, containing characters and underscores, grouped by dots. One such word (with trailing dots) can be represented as [a-zA-Z_]+\.{0,2}. The \w you use actually is a shortcut for [a-zA-Z0-9_], I think. It does NOT represent a word, but a "word character".
This last expression may be present multiple times, so the capturing expression for the first group looks like
((?://)?(?:[a-zA-Z_]+\.{0,2})+)
For the part in the brackets, one can use \([^)]*\), which means an opening bracket (escaped, since it has special meaning, followed by an arbitrary number of non-brackets (not escaped, sind it has no special meaning inside a character class), and then a closing bracket.
Combined with ^ and $ to mark the beginning and end of line respectively, we arrive at
^((?://)?(?:[a-zA-Z_]+\.{0,2})+)(\([^)]*\))$
If I misunderstood your requirements, and need help with those, please ask in the comments.

You may find this website useful for testing your regex's http://www.fileformat.info/tool/regex.htm.
As a general approach, try building the regex up from one that handles a simple case, write some tests and get it to pass. Then make the regex more complicated to handle the other cases as well. Make sure it passes both the original and the new tests.

There are so many things wrong with your pattern:
/{2}?: what do you think ? means here? Because if you think it makes /{2} optional, you're wrong. Instead ? is a reluctant modifier for the {2} repetition. Perhaps something like (?:/{2})? is what you intend.
[\w+_*] : what do you think the + and * means here? Because if you think they represent repetition, you're wrong. This is a character class definition, and + and * literally means the characters + and *. Perhaps you intend... actually I'm not sure what you intend.
Solution attempt
Here's an attempt at guessing what your spec is:
String PART_REGEX =
"(word)(?:<<#(word) = (word)>>)?"
.replace("word", "\\w+")
.replace(" ", "\\s*")
.replace("<<", "\\(")
.replace(">>", "\\)");
Pattern entirePattern = Pattern.compile(
"(?://)?part(?:\\.{1,2}part)*"
.replace("part", PART_REGEX)
);
Pattern partPattern = Pattern.compile(PART_REGEX);
Then we can test it as follows:
String[] tests = {
"item.sub_element.subsubelement(#key = string)",
"item..subsub_element(#key = string)",
"//subsub_element(#key = string)",
"item(#key = string)",
"one.dot",
"two..dots",
"three...dots",
"part1(#k1=v1)..part2(#k2=v2)",
"whatisthis(#k=v1=v2)",
"noslash",
"/oneslash",
"//twoslashes",
"///threeslashes",
"//multiple//double//slashes",
"//multiple..double..dots",
"..startingwithdots",
};
for (String test : tests) {
System.out.println("[ " + test + " ]");
if (entirePattern.matcher(test).matches()) {
Matcher part = partPattern.matcher(test);
while (part.find()) {
System.out.printf(" [%s](%s => %s)%n",
part.group(1),
part.group(2),
part.group(3)
);
}
}
}
The above prints:
[ item.sub_element.subsubelement(#key = string) ]
[item](null => null)
[sub_element](null => null)
[subsubelement](key => string)
[ item..subsub_element(#key = string) ]
[item](null => null)
[subsub_element](key => string)
[ //subsub_element(#key = string) ]
[subsub_element](key => string)
[ item(#key = string) ]
[item](key => string)
[ one.dot ]
[one](null => null)
[dot](null => null)
[ two..dots ]
[two](null => null)
[dots](null => null)
[ three...dots ]
[ part1(#k1=v1)..part2(#k2=v2) ]
[part1](k1 => v1)
[part2](k2 => v2)
[ whatisthis(#k=v1=v2) ]
[ noslash ]
[noslash](null => null)
[ /oneslash ]
[ //twoslashes ]
[twoslashes](null => null)
[ ///threeslashes ]
[ //multiple//double//slashes ]
[ //multiple..double..dots ]
[multiple](null => null)
[double](null => null)
[dots](null => null)
[ ..startingwithdots ]
Attachments
Source code and output on ideone.com

Related

Replace substring only in certain portions of text delimited by some character

I would need to replace all occurrences of a substring, only if it is preceded by "]" and followed by "[" (preceeded and followed but not necessarily next to the substring). Example:
This would be the string where I need to do the substitutions:
[style and tags info] valid text info [more style info] more info here[styles]
If the expression to replace was: info -> change (it may be more than a single word)
The result should be:
[style and tags info] valid text change [more style info] more change here [styles]
My idea was to use a regex to isolate the words I have to change and then make the replacement with a call to replaceAll.
But I have tried several regexs to isolate the search expression without success. Mainly because I would need something like
(?<=.*)
this is, a lookbehind with arbitrary number of characters before the word I am looking for. And this is not supported by Java regex (nor any other implementation of regex that I know).
I have found this solution, written in matlab, but it seems harder to replicate in Java:
Matlab regex - replace substring ONLY within angled brackets
Is there a simpler approach? Some regex I have not considered?
I'd say the easiest way here is to split the string into (parts outside the brackets) and (parts inside the brackets), and then only apply the replacements to (parts inside the brackets).
For example, you can do this using split (this assumes that your []s are evenly balanced, you're not opening two [[, etc):
String[] parts = str.split("[\[\]]");
StringBuilder sb = new StringBuilder(str.length());
for (int i = 0; i < parts.length; i++) {
if (i % 2 == 0) {
// This bit was outside [].
sb.append(parts[i]);
} else {
// This bit was inside [], so apply the replacement
// (and re-append the delimiters).
sb.append("[");
sb.append(parts[i].replace("info", "change"));
sb.append("]");
}
}
String newStr = sb.toString();
It seems more appropriate to match and skip the substrings that start with [, then have 1 or more chars other than [ and ] up to the closing ], and replace info with change in all other contexts. For this purpose, you may use Matcher#appendReplacement() method:
String s = "[style and tags info] valid text info [more style info] more info here[styles]";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("\\[[^\\]\\[]+]|\\b(info)\\b").matcher(s);
while (m.find()) {
if (m.group(1) != null) {
m.appendReplacement(result, "change");
}
else {
m.appendReplacement(result, m.group());
}
}
m.appendTail(result);
System.out.println(result.toString());
// => [style and tags info] valid text change [more style info] more change here[styles]
See the Java demo
The \[[^\]\[]+]|\b(info)\b regex matches those [...] substrings with \[[^\]\[]+] alternative branch and \b(info)\b branch (Group 1) captures the whole word info. If Group 1 matches, the replacement occurs, else, the matched [...] substring is inserted back into the result.
As for your original logic, yes, you can use a "simple" .replaceAll with the (?:\G|(?<=]))([^\]\[]*?)\binfo\b regex (with $1change replacement), but I doubt it is what you need.

How to get match within [ and ] in String using regex

I have a requirement to get the matching string within square brackets [].
For eg., in a String input like "[***]qwerty",
I should get the match as "***" string.
The regex I used in vain is "\\[(.+)\\]"
My Java code is as below:
Pattern pattern = Pattern.compile(regex_custom_delimiter_pattern); //see regex above
Matcher matcher = pattern.matcher("[***]qwerty");
String delimiter = null;
if (matcher.find()) {
delimiter = matcher.group(0);
}
Any help is appreciated..wondering what I'm missing in the regex that I used :(
That should work correctly, but you can use a more efficient expression if the value between [ and ] doesn't contain [ or ] literally:
\\[([^\\]]+)]
Or if the value can contain [ or ] then:
\\[(.+?)\\]
Also your main problem is that you are getting group 0 matcher.group(0) which is the entire match, your value is stored in group 1 so you need matcher.group(1).
You need group 1 instead of group 0. Group 0 is the whole match.
delimiter = matcher.group(0);

Regular expression to get characters before brackets or comma

I'm pulling my hair out a bit with this.
Say I have a string 7f8hd::;;8843fdj fls "": ] fjisla;vofje]]} fd)fds,f,f
I want to now extract this 7f8hd::;;8843fdj fls "": from the string based on the premise that the string ends with either a } or ] or , or ) but all those characters could be present I only need the first one.
I have tried without success to create a regular expression with a Matcher and Pattern class but I just can't seem to get it right.
The best I could come up with is below but my reg exp just doesn't seem to work like I think it should.
String line = "7f8hd::;;8843fdj fls "": ] fjisla;vofje]]} fd)fds,f,f";
Matcher m = Pattern.compile("(.*?)\\}|(.*?)\\]|(.*?)\\)|(.*?),").matcher(line);
while (matcher.find()) {
System.out.println(matcher.group());
}
I'm clearly not understanding reg exp correctly. Any help would be great.
^[^\]}),]*
matches from the start of the string until (but excluding) the first ], }, ) or ,.
In Java:
Pattern regex = Pattern.compile("^[^\\]}),]*");
Matcher regexMatcher = regex.matcher(line);
if (regexMatcher.find()) {
System.out.println(regexMatcher.group());
}
(You can actually remove the backslashes ([^]}),]), but I like to keep them there for clarity and for compatibility since not all regex engines recognize that idiom.)
Explanation:
^ # Match the start of the string
[^\]}),]* # Match zero or more characters except ], }, ) or ,
you could just cut the rest part by replaceAll:
String newStr = yourStr.replaceAll("[\\])},].*", "");
or by split() and get the first element.
String newStr = yourStr.split("[\\])},]")[0];
You can use this (as java string):
"(.+?)[\\]},)].*"
here is a fiddle
Could you try the regular expression (.*?)[}\]),](.*?) I tested it on rubular and worked against your example.

Match exactly N repetitions of the same character

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.
Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]
I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.
It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc
Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.
How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!
Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}
>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']
In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}

Enforce sequence and group same time with regex?

A bit of continuation of Get groups with regex and OR
Sample
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number (>1) of 7 Chars codes following (at least one 7char code). The space between the groups also can be a '.'
With this expression I get it nicely grouped
/^(AB|AC|AD)|((\S{7})+)/
I can access the 2chars code with group[0] and so on.
Can I enforce the rule as above the same time ?
With above regex the following lines are also valid (because of the OR | in the regex statement)
AC
dfghjkl
asdfgh hjklpoi
Which is not what I need.
Thanks again to the regex experts
Try that:
^(A[BCD])(([ .])([A-Z]{7}))+$
Personally, I would do this in two separate steps
I'd check the string matches a regular expression
I'd split matching strings based on the separator chars [ .]
This code:
def input = [
'AD ABCDEFG HIJKLMN',
'AB HIJKLMN',
'AC DJKEJKW SJKLAJL JSHELSJ',
'AC',
'dfghjkl',
'asdfgh hjklpoi',
'AC DJKEJKW.SJKLAJL JSHELSJ',
]
def regexp = /^A[BCD]([ .](\S{7}))+$/
def result = input.inject( [] ) { list, inp ->
// Does the line match the regexp?
if( inp ==~ regexp ) {
// If so, split it
list << inp.split( /[ .]/ )
}
list
}
println result
Shows you an example of what I mean, and prints out:
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]

Categories