Regular expression to select strings with a character appearing on odd occurences - java

I'm trying to create a regex that will select words where the quote character appears on odd occurences. And I'm stuck...
Let's say I have these 4 strings :
hello'
pl'pl'op
'heger
qwe'rty
I should get this list in return :
hello'
'heger
qwe'rty
I'm running around in circles and I don't even know if it is possible to do that in a regex. I'm not so good in regex.
Should I just loop on each characters of all the strings, count the amount of quotes and do a modulo operation to check if the number is odd?

Code
See regex in use here
^(?!(?:\w*'\w*'\w*)+$)[\w']+$
As per the comments below my question, an improvement can be made by changing the non-capture group to an atomic group as the following pattern demonstrates. This optimization is thanks to #Thefourthbird:
^(?!(?>\w*'\w*'\w*)+$)[\w']+$
Results
Input
hello'
pl'pl'op
'heger
qwe'rty
q'
q'q'
q'q'q'
q'q'q'q'
q'q'q'q'q'
q'q'q'q'q'q'
q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'q'
Output
Only matches are shown below
hello'
'heger
qwe'rty
q'
q'q'q'
q'q'q'q'q'
q'q'q'q'q'q'q'
q'q'q'q'q'q'q'q'q'
Explanation
^ Assert position at the start of the line
(?!(?:\w*'\w*'\w*)+$) Negative lookahead ensuring what follows doesn't match
(?:\w*'\w*'\w*)+ Match any combination of apostrophes and word characters where the apostrophe character appears exactly twice, one or more times (this means 2,4,6,8,10,... times)
$ Assert position at the end of the line
[\w']+ Match one or more word characters or apostrophes '
$ Assert position at the end of the line

You don't need a regular expression. Just check if countMatches return an odd or not
public class Main {
public static void main(String[] args) {
String check = "pl'pl'op";
System.out.println("Ocurrences: " + StringUtils.countMatches(check, "'"));
}
}
Output: Ocurrences: 2

Try this:
([^']*'[^']*'[^'])*[^']*'[^']*
The idea is to capture in the group an even (possibly 0) number of quotes, and the text between them, and then one more quote.

Related

Extract substring from end till first alphabet in java

I have a string of format: A-2-Q4567
More examples: AB-456-T12, A24-5-M12345, etc.
I want to extract the last numerical values out of these strings, which are: 4567, 12, 12345 respectively (which is the numerical value of the substring from the end till first non-numeric character is encountered)
I can split the string, get the last string from the splitted string array, and then do a parseInt after removing the non-numerical characters from it.
But is there a more elegant way of doing this?
You can use this regex: (\d+$). It returns the last sequence of digits in the string.
EDIT - some explanation:
The \d means any digit.
The + means one or more of the previous symbols. Since the previous symbol is a digit, then \d+ means "one or more digits".
The $ means the end of the string, so \d+$ is the last sequence of digits in the string.
you can do this :
String getLastNumeric(String input)
{
String str="";
char c;
for(int i=input.length()-1;i>=0 && Character.isDigit(c=input.charAt(i));i--)
str=c+str;
return str;
}
The regex solutions might be more elegant but performance-wise I think the above is the best because Regex match can be more expensive than a simple for loop with a simple condition to evaluate.
Ofcourse The Regex is more flexible, what if your requirements change and now a dash "-" must precede the numbers ? with Regex it should be just a matter of changing one regex expression.
I put the Regex version here but remember if you're sure your requirements won't change I think the above solution is better on the CPU :
Matcher matcher= Pattern.compile("(\\d+$)").matcher(input);
if(matcher.find())
return matcher.group();
return "";

Regex matcher to handle a character or end of line

I would like to create a matching pattern for a situation like this
DOMAIN+("Y|A")?
I would like the matching options to be only
DOMAIN
DOMAINY
DOMAINA
but seems like DOMAINX, DOMAINY etc. are matching as well.
Yes, they are matching because you did not specify that the String needed to end with this. DOMAIN(Y|A)? is matching DOMAINX because it rightfully contains DOMAIN followed by nothing (which is accepted since ? validates 0 or 1 occurence).
You can add this restriction by specifying $ at the end of the regular expression.
Sample code that shows the result of matches. In your full code, you probably want to compile a Pattern instead of doing it each time.
public static void main(String[] args) {
String regex = "DOMAIN(Y|A)?$";
System.out.println("DOMAIN".matches(regex)); // prints true
System.out.println("DOMAINX".matches(regex)); // prints false
System.out.println("DOMAINY".matches(regex)); // prints true
System.out.println("DOMAINA".matches(regex)); // prints true
}
You could use word boundaries, \b, in order to prevent strings such as "DOMAINX" from being matched.
If you just want to handle cases where there are characters after the word, add \b to the end:
DOMAIN(?:Y|A)?\b
Otherwise, you could place \b around the expression to handle cases where there may be characters at the start/end:
\bDOMAIN(?:Y|A)?\b
I also made (?:Y|A) a non-capturing group and I removed the quotes.
See the matches here.
However, as your title implies, if you only want to handle characters at the end of a line, use the $ anchor at the end of your expression:
DOMAIN(?:Y|A)?$
You may have to add the m (multi-line) flag so that the anchor matches at the start/end of a line rather than at the start/end of the string:
(?m)DOMAIN(?:Y|A)?$
You need this
DOMAIN(Y|A)?
If you need it to be a word in text you should anchor it with \b as Josh shows.
Your regex does the following
DOMAIN+("Y|A")?
DOMAIN+("Y|A")?
Options: Case sensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Regex syntax only
[Match the character string “DOMAI” literally (case sensitive)][1] DOMAI
[Match the character “N” literally (case sensitive)][1] N+
[Between one and unlimited times, as many times as possible, giving back as needed (greedy)][2] +
[Match the regex below and capture its match into backreference number 1][3] ("Y|A")?
[Between zero and one times, as many times as possible, giving back as needed (greedy)][4] ?
[Match this alternative (attempting the next alternative only if this one fails)][5] "Y
[Match the character string “"Y” literally (case sensitive)][1] "Y
[Or match this alternative (the entire group fails if this one fails to match)][5] A"
[Match the character string “A"” literally (case sensitive)][1] A"

Regex to ignore variable initializations that have already been declared

I need a regex that would ignore a token if a part of the token has already been captured before.
Example
var bold, det, bold=6, sum, k
Here, bold=6 should be ignored because bold has already been captured.
Also, var must be present before any matching can take place, the last token k should not be followed by a comma. Only the variables within var and the last token k should be followed by a comma.
Another Example
var bold=6, det, bold, sum, k
Here, bold which follows det should be ignored because bold=6 has already been captured.
i tried using this pattern (?:\\bvar\\b|\\G)\\s*(\\w+)(?:,|$), but it doesn't ignore what has been repeated.
Depends what information you need to get you can try with:
Solution working only in Java, will give you variable name and start
and end idices:
(?<=var.{0,999})(?<!=)(?!var\b)\b(?<var>\w+)\b(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999})
RegexPlanet Demo
it uses ugly, but quite effective feature of Java regex: intervals
(x{min,max}) in lookbehind. As long as you use interval with
minimal and maximal length, you can use it in Java regex. So instead
of .* you can use for example .{0,999}. It will fail if there
need to be more char than 999, you can use bigger number, but I
think it is not necessary in this cese. Named group <var> is optional here, you can replece it in code with normal group.
Implementation in Java:
public class Test{
public static void main(String[] args){
String test = "var bold, det, bold=6, sum, k\n" +
"var foo=6, abc, foo, xyz, k";
Matcher matcher = Pattern.compile("(?<=var.{0,999})(?<!=)(?!var)\\b(?<var>\\w+)\\b(?<!var.{1,999}(?=\\k<var>).{1,999}(?=\\k<var>).{1,999})").matcher(test);
while(matcher.find()){
System.out.println(matcher.group("var") + "," + matcher.start("var") + "," + matcher.end("var"));
}
}
}
with output (variable name, start index, end index):
bold,4,8
det,10,13
sum,23,26
k,28,29
foo,34,37
abc,41,44
xyz,51,54
k,56,57
Explanation of regex:
(?<=var.{0,999}) - must be preceded by text var followed by any
number of characters, but not new line,
(?<!=) - should not be preceded by equal sign, to avoid matching variable name and value as different matches,
(?!var\b) - cannot be followed by var word, to avoid matching this word,
\b(?<var>\w+)\b - separate word, captured into <var> group,
(?<!var.{1,999}(?=\k<var>).{1,999}(?=\k<var>).{1,999}) - the matched word cannot by preceded by var word followed by some chars, including captured word, followed by some chars, inclusing captured word again,
But as I wrote, it will work only in Java.
If you need just variable names, you can use:
(?<=var\s|\G,\s)(?<var>\w+)(?=,|$)|(?<=var\s|\G,\s)(?<initialized>[^,\n]+)
DEMO
to get variable names without duplications. But if you want
start/end indices, it will capture into group second occurence of
duplicated variable name.
You can tweak your regex to this with a negative lookahead:
(?:\bvar\b|\G)\s*(?:(\w+)(?!.*\b\1\b)(?:=\w+)?|\S+)(?:,|\bk\b)
RegEx Demo
Rather than keeping track of what it has already matched it will skip matching a word if it is followed in rest of the string.
Here (?!.*\b\1\b) is a negative lookahead that will avoid matching a word if same word is found on RHS of input. \1 is back-reference of matched word.
RegEx Breakup:
(?:\bvar\b|\G) # match text var or \G
\s* # match 0 more spaces
(?: # start non-capturing group
(\w+)(?!.*\b\1\b) # match a word if same word is found in rest of the input
(?:=\w+)? # followed by optional = and some value
| # regex alternation
\S+ # OR match 1 or more non-space character
) # close non-capturing group
(?:,|\bk\b) # match a comma or k

Programming error leads to inexplanable regex

for a test I created following regex by mistake:
|(\\w+)|
I was puzzled that this regex really works and I can't explain the result:
public static void main(String[] args) {
String toReplace="Hey I'm a lovely String an I'm giving my |value| worth!";
// String replacement1="2 cent"; // I planned to replace |value| with 2 cent
String replacement1="#"; // to produce a better Output
String regex="|(\\w+)|"; // I forgot to escape the |
replacement1="#";
result=toReplace.replaceAll(regex,replacement1);
System.out.println(result);
}
the result is:
#H#e#y# #I#'#m# #a# #l#o#v#e#l#y# #S#t#r#i#n#g# #a#n# #I#'#m# #g#i#v#i#n#g# #m#y# #|#v#a#l#u#e#|# #w#o#r#t#h#!#
My ideas so far are that java tries to replace "nothing" between the characters but why not the characters itself?
\\w+ should match the 'H'
I would expect that every char is replaced by 3 # signs or only by one but that the characters are not replaced puzzles me.
You're right, this regex matches the empty string between each character.
Since the first alternative (the empty string left of |) matches, the rest of the pattern isn't even tried, so the \w+ isn't even reached by the matching engine. You could have written any (valid) pattern to the right of that first |, it wouldn't ever be reached.
The engine works the following way: It has a current position cursor in the subject string. It tries to match starting at that current position. Since your regex is a match, it will perform the replacement at this point, and then move the current position cursor after the found match.
But since the match is zero-width, it simply advances to the next character, because not doing so would result in an infinite loop.

validate string in java

I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F

Categories