Enforce sequence and group same time with regex? - java

A bit of continuation of Get groups with regex and OR
Sample
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number (>1) of 7 Chars codes following (at least one 7char code). The space between the groups also can be a '.'
With this expression I get it nicely grouped
/^(AB|AC|AD)|((\S{7})+)/
I can access the 2chars code with group[0] and so on.
Can I enforce the rule as above the same time ?
With above regex the following lines are also valid (because of the OR | in the regex statement)
AC
dfghjkl
asdfgh hjklpoi
Which is not what I need.
Thanks again to the regex experts

Try that:
^(A[BCD])(([ .])([A-Z]{7}))+$

Personally, I would do this in two separate steps
I'd check the string matches a regular expression
I'd split matching strings based on the separator chars [ .]
This code:
def input = [
'AD ABCDEFG HIJKLMN',
'AB HIJKLMN',
'AC DJKEJKW SJKLAJL JSHELSJ',
'AC',
'dfghjkl',
'asdfgh hjklpoi',
'AC DJKEJKW.SJKLAJL JSHELSJ',
]
def regexp = /^A[BCD]([ .](\S{7}))+$/
def result = input.inject( [] ) { list, inp ->
// Does the line match the regexp?
if( inp ==~ regexp ) {
// If so, split it
list << inp.split( /[ .]/ )
}
list
}
println result
Shows you an example of what I mean, and prints out:
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]

Related

Java regex with quantifier: replace matches by dynamically generated string

I want to replace all matches in a String with a dynamic number of characters; let's use [\d]+ as a simple example regex:
Desired results:
1984 -> DDDD
1 -> D
123456789 -> DDDDDDDDD
a 123 b ->a DDD b`
The common Java approach for replacing regex matches looks like this:
Pattern p = Pattern.compile("[\d]+");
String input = [...];
String replacement = ???;
String result = p.matcher(input).replaceAll(replacement);
My question is: do I have to extract the individual matches, measure their lengths, and then generate the replacement string based on that? Or does Java provide any more straightforward approach to that?
Update: I actually under-complicated the problem by giving rather simple examples. Here are some more that should be caught, taking more context into account:
<1984/> (Regex: <[\d]+/>) -> DDDDDDD
123. -> DDDD, whereas: 123 -> 123 (i.e. no match)
Note: I am not trying to parse HTML with a regex.
You're really overthinking/overcomplicating this. Just use \d with replacement as D as seen here. There's no need to get the length of the string or do any additional processing; just a straight up replaceAll()
See code in use here
final String[] a = {"1984","1","123456789","a 123 b"};
for (String s: a) {
System.out.println(s.replaceAll("\\d","D"));
}

Regular expression find if same character repeats 3 or more no of times in Java

My requirement is to use only Java regular expression to check if a given string contains the same character repeated more than 3 times in continuation within the string.
For ex :
"hello" -> false
"ohhhhh" -> true
"whatsuppp" -> true
You can use the following regex for your problem:
^.*(.)\1\1.*$
Explanation
^ starting point of your string
.* any char 0 to N times
(.) one char in the capturing group that will be used by the backreference
\1 back reference to the captured character (we call it twice to force your 3 times repetition constraint)
.* any char 0 to N times
$ end of the input string
I have tested on :
hello -> false
ohhhhh -> true
whatsuppp -> true
aaa -> true
aaahhhahj -> true
abcdef -> false
abceeedef -> true
Last but not least, you have to add a backslash \ before each backslash \ in your regex before being able to use it in your Java code.
This give you the following prototype Java code:
ArrayList <String> strVector = new ArrayList<String>();
strVector.add("hello");
strVector.add("ohhhhh");
strVector.add("whatsuppp");
strVector.add("aaa");
strVector.add("aaahhhahj");
strVector.add("abcdef");
strVector.add("abceeedef");
Pattern pattern = Pattern.compile("^.*(.)\\1\\1.*$");
Matcher matcher;
for(String elem:strVector)
{
System.out.println(elem);
matcher = pattern.matcher(elem);
if (matcher.find())System.out.println("Found you!");
else System.out.println("Not Found!");
}
giving at execution the following output:
hello
Not Found!
ohhhhh
Found you!
whatsuppp
Found you!
aaa
Found you!
aaahhhahj
Found you!
abcdef
Not Found!
abceeedef
Found you!

Regular Expression to extract text containing pipe charcters

I have a string and required an regular expression to extract the substring from a string.
Example: this is a|b|c|d whatever e|f|g|h
Result: a|b|c|d, e|f|g|h
However based on the Java code that I wrote, it is producing the results as follows:
Pattern ptyy = Pattern.compile("\\|*.+? ");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whatever e|f|g|h");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}
Result
this
is
a|b|c|d
whatever
The result is not what I have hoped for. Any advice?
I think this regex is enough (.\|)+.
see the example
(.\|) this find all the a|b|...| and last . find the last char of the sub-string.
Your \|*.+? pattern matches 0 or more pipes, then 1 or more any chars other than a newline up to the first space. Thus, it matches almost all non-whitespace chunks in a string.
If a, b and c are just placeholders and there can be any non-whitespace chars, I'd suggest:
[^\s|]+(?:\|[^\s|])+
See the regex demo
Details:
[^\s|]+ - 1 or more chars other than whitespace and |
(?:\|[^\s|])+ - 1 or more sequences of:
\| - a literal |
[^\s|] - 1 or more chars other than whitespace and |
Java demo:
Pattern ptyy = Pattern.compile("[^\\s|]+(?:\\|[^\\s|])+");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whatever e|f|g|h");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}
Based on your advice, i managed to come up with my own regular expression that can address different combination of the pipe expression.
Pattern ptyy = Pattern.compile("[^\\s|]+(?:\\|[^\\s|])+");
Matcher matcher_values = ptyy.matcher("this is a|b|c|d whater e|f|g|h and Az|09|23|A3 and 22|1212|12121|55555");
while (matcher_values.find()) {
String line = matcher_values.group(0);
System.out.println(line);
}
This will enable me to get the result
a|b|c|d
e|f|g|h
Az|09|23|A
22|1212|12121|5
Thanks everyone!

Match exactly N repetitions of the same character

How do I write an expression that matches exactly N repetitions of the same character (or, ideally, the same group)? Basically, what (.)\1{N-1} does, but with one important limitation: the expression should fail if the subject is repeated more than N times. For example, given N=4 and the string xxaaaayyybbbbbzzccccxx, the expressions should match aaaa and cccc and not bbbb.
I'm not focused on any specific dialect, feel free to use any language. Please do not post code that works for this specific example only, I'm looking for a general solution.
Use negative lookahead and negative lookbehind.
This would be the regex: (.)(?<!\1.)\1{N-1}(?!\1) except that Python's re module is broken (see this link).
English translation: "Match any character. Make sure that after you match that character, the character before it isn't also that character. Match N-1 more repetitions of that character. Make sure that the character after those repetitions is not also that character."
Unfortunately, the re module (and most regular expression engines) are broken, in that you can't use backreferences in a lookbehind assertion. Lookbehind assertions are required to be constant length, and the compilers aren't smart enough to infer that it is when a backreference is used (even though, like in this case, the backref is of constant length). We have to handhold the regex compiler through this, as so:
The actual answer will have to be messier: r"(.)(?<!(?=\1)..)\1{N-1}(?!\1)"
This works around that bug in the re module by using (?=\1).. instead of \1. (these are equivalent most of the time.) This lets the regex engine know exactly the width of the lookbehind assertion, so it works in PCRE and re and so on.
Of course, a real-world solution is something like [x.group() for x in re.finditer(r"(.)\1*", "xxaaaayyybbbbbzzccccxx") if len(x.group()) == 4]
I suspect you want to be using negative lookahead: (.)\1{N-1}(?!\1).
But that said...I suspect the simplest cross-language solution is just write it yourself without using regexes.
UPDATE:
^(.)\\1{3}(?!\\1)|(.)(?<!(?=\\2)..)\\2{3}(?!\\2) works for me more generally, including matches starting at the beginning of the string.
It is easy to put too much burden onto regular expressions and try to get them to do everything, when just nearly everything will do!
Use a regex to find all substrings consisting of a single character, and then check their length separately, like this:
use strict;
use warnings;
my $str = 'xxaaaayyybbbbbzzccccxx';
while ( $str =~ /((.)\2*)/g ) {
next unless length $1 == 4;
my $substr = $1;
print "$substr\n";
}
output
aaaa
cccc
Perl’s regex engine does not support variable-length lookbehind, so we have to be deliberate about it.
sub runs_of_length {
my($n,$str) = #_;
my $n_minus_1 = $n - 1;
my $_run_pattern = qr/
(?:
# In the middle of the string, we have to force the
# run being matched to start on a new character.
# Otherwise, the regex engine will give a false positive
# by starting in the middle of a run.
(.) ((?!\1).) (\2{$n_minus_1}) (?!\2) |
#$1 $2 $3
# Don't forget about a potential run that starts at
# the front of the target string.
^(.) (\4{$n_minus_1}) (?!\4)
# $4 $5
)
/x;
my #runs;
while ($str =~ /$_run_pattern/g) {
push #runs, defined $4 ? "$4$5" : "$2$3";
}
#runs;
}
A few test cases:
my #tests = (
"xxaaaayyybbbbbzzccccxx",
"aaaayyybbbbbzzccccxx",
"xxaaaa",
"aaaa",
"",
);
$" = "][";
for (#tests) {
my #runs = runs_of_length 4, $_;
print qq<"$_":\n>,
" - [#runs]\n";
}
Output:
"xxaaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"aaaayyybbbbbzzccccxx":
- [aaaa][cccc]
"xxaaaa":
- [aaaa]
"aaaa":
- [aaaa]
"":
- []
It’s a fun puzzle, but your regex-averse colleagues will likely be unhappy if such a construction shows up in production code.
How about this in python?
def match(string, n):
parts = []
current = None
for c in string:
if not current:
current = c
else:
if c == current[-1]:
current += c
else:
parts.append(current)
current = c
result = []
for part in parts:
if len(part) == n:
result.append(part)
return result
Testing with your string with various sizes:
match("xxaaaayyybbbbbzzccccxx", 6) = []
match("xxaaaayyybbbbbzzccccxx", 5) = ["bbbbb"]
match("xxaaaayyybbbbbzzccccxx", 4) = ['aaaa', 'cccc']
match("xxaaaayyybbbbbzzccccxx", 3) = ["yyy"]
match("xxaaaayyybbbbbzzccccxx", 2) = ['xx', 'zz']
Explanation:
The first loop basically splits the text into parts, like so: ["xx", "aaaa", "yyy", "bbbbb", "zz", "cccc", "xx"]. Then the second loop tests those parts for their length. In the end the function only returns the parts that have the current length. I'm not the best at explaining code, so anyone is free to enhance this explanation if needed.
Anyways, I think this'll do!
Why not leave to regexp engine what it does best - finding longest string of same symbols and then check length yourself?
In Perl:
my $str = 'xxaaaayyybbbbbzzccccxx';
while($str =~ /(.)\1{3,}/g){
if(($+[0] - $-[0]) == 4){ # insert here full match length counting specific to language
print (($1 x 4), "\n")
}
}
>>> import itertools
>>> zz = 'xxaaaayyybbbbbzzccccxxaa'
>>> z = [''.join(grp) for key, grp in itertools.groupby(zz)]
>>> z
['xx', 'aaaa', 'yyy', 'bbbbb', 'zz', 'cccc', 'xx', 'aa']
From there you can iterate through the list and check for occasions when N==4 very easily, like this:
>>> [item for item in z if len(item)==4]
['cccc', 'aaaa']
In Java we can do like below code
String test ="xxaaaayyybbbbbzzccccxx uuuuuutttttttt";
int trimLegth = 4; // length of the same characters
Pattern p = Pattern.compile("(\\w)\\1+",Pattern.CASE_INSENSITIVE| Pattern.MULTILINE);
Matcher m = p.matcher(test);
while (m.find())
{
if(m.group().length()==trimLegth) {
System.out.println("Same Characters String " + m.group());
}
}

How do I write a regular expression for these path expressions

I'm trying to write a helper method that breaks down path expressions and would love some help. Please consider a path pattern like the following four (round brackets indicate predicates):
item.sub_element.subsubelement(#key = string) ; or,
item..subsub_element(#key = string) ; or,
//subsub_element(#key = string) ; or,
item(#key = string)
what would a regular expression look like that matches those?
What I have come up with is this:
((/{2}?[\\w+_*])(\\([_=##\\w+\\*\\(\\)\\{\\}\\[\\]]*\\))?\\.{0,2})+
I'm reading this as: "match one or more occurences of a string that consists of two groups: group one consists of one or more words with optional underscores and optional double forward slash prefix ; group two is optional and consists of at least one word with all other characters optional ; groups are trailed by zero to two dots."
However, a test run on the fourth example with Matcher.matches() returns false. So, where's my error?
Any ideas?
TIA,
FK
Edit: from trying with http://www.regexplanet.com/simple/index.html it seems I wasn't aware of the difference between the Matcher.matches() and the Matcher.find() methods of the Matcher object. I was trying to break down the input string in to substrings that match my regex. Consequently I need to use find(), not matches().
Edit2: This does the trick
([a-zA-Z0-9_]+)\.{0,2}(\(.*\))?
You misunderstand character classes, I think. I've found that for testing regular expressions, http://gskinner.com/RegExr/ is of great help. As a tutorial for regular expressions, I'd recommend http://www.regular-expressions.info/tutorial.html.
I am not entirely sure, how you want to group your strings. Your sentence seems to suggest, that your first group is just the item part of item..subsub_element(#key = string), but then I am not sure what the second group should be. Judging from what I deduce from your Regex, I'll just group the part before the brackets into group one, and the part in the brackets into group two. You can surely modify this if I misunderstood you.
I don't escape the expression for Java, so you'd have to do that. =)
The first group should begin with an optional double slash. I use
(?://)?. Here ?: means that this part should not be captured, and the last ? makes the group before it optional.
Following that, there are words, containing characters and underscores, grouped by dots. One such word (with trailing dots) can be represented as [a-zA-Z_]+\.{0,2}. The \w you use actually is a shortcut for [a-zA-Z0-9_], I think. It does NOT represent a word, but a "word character".
This last expression may be present multiple times, so the capturing expression for the first group looks like
((?://)?(?:[a-zA-Z_]+\.{0,2})+)
For the part in the brackets, one can use \([^)]*\), which means an opening bracket (escaped, since it has special meaning, followed by an arbitrary number of non-brackets (not escaped, sind it has no special meaning inside a character class), and then a closing bracket.
Combined with ^ and $ to mark the beginning and end of line respectively, we arrive at
^((?://)?(?:[a-zA-Z_]+\.{0,2})+)(\([^)]*\))$
If I misunderstood your requirements, and need help with those, please ask in the comments.
You may find this website useful for testing your regex's http://www.fileformat.info/tool/regex.htm.
As a general approach, try building the regex up from one that handles a simple case, write some tests and get it to pass. Then make the regex more complicated to handle the other cases as well. Make sure it passes both the original and the new tests.
There are so many things wrong with your pattern:
/{2}?: what do you think ? means here? Because if you think it makes /{2} optional, you're wrong. Instead ? is a reluctant modifier for the {2} repetition. Perhaps something like (?:/{2})? is what you intend.
[\w+_*] : what do you think the + and * means here? Because if you think they represent repetition, you're wrong. This is a character class definition, and + and * literally means the characters + and *. Perhaps you intend... actually I'm not sure what you intend.
Solution attempt
Here's an attempt at guessing what your spec is:
String PART_REGEX =
"(word)(?:<<#(word) = (word)>>)?"
.replace("word", "\\w+")
.replace(" ", "\\s*")
.replace("<<", "\\(")
.replace(">>", "\\)");
Pattern entirePattern = Pattern.compile(
"(?://)?part(?:\\.{1,2}part)*"
.replace("part", PART_REGEX)
);
Pattern partPattern = Pattern.compile(PART_REGEX);
Then we can test it as follows:
String[] tests = {
"item.sub_element.subsubelement(#key = string)",
"item..subsub_element(#key = string)",
"//subsub_element(#key = string)",
"item(#key = string)",
"one.dot",
"two..dots",
"three...dots",
"part1(#k1=v1)..part2(#k2=v2)",
"whatisthis(#k=v1=v2)",
"noslash",
"/oneslash",
"//twoslashes",
"///threeslashes",
"//multiple//double//slashes",
"//multiple..double..dots",
"..startingwithdots",
};
for (String test : tests) {
System.out.println("[ " + test + " ]");
if (entirePattern.matcher(test).matches()) {
Matcher part = partPattern.matcher(test);
while (part.find()) {
System.out.printf(" [%s](%s => %s)%n",
part.group(1),
part.group(2),
part.group(3)
);
}
}
}
The above prints:
[ item.sub_element.subsubelement(#key = string) ]
[item](null => null)
[sub_element](null => null)
[subsubelement](key => string)
[ item..subsub_element(#key = string) ]
[item](null => null)
[subsub_element](key => string)
[ //subsub_element(#key = string) ]
[subsub_element](key => string)
[ item(#key = string) ]
[item](key => string)
[ one.dot ]
[one](null => null)
[dot](null => null)
[ two..dots ]
[two](null => null)
[dots](null => null)
[ three...dots ]
[ part1(#k1=v1)..part2(#k2=v2) ]
[part1](k1 => v1)
[part2](k2 => v2)
[ whatisthis(#k=v1=v2) ]
[ noslash ]
[noslash](null => null)
[ /oneslash ]
[ //twoslashes ]
[twoslashes](null => null)
[ ///threeslashes ]
[ //multiple//double//slashes ]
[ //multiple..double..dots ]
[multiple](null => null)
[double](null => null)
[dots](null => null)
[ ..startingwithdots ]
Attachments
Source code and output on ideone.com

Categories