Java: Weirdness in replaceAll RegEx - java

I'm trying to manipulate a String in Java to recognize the markdown options in Facebook Messenger.
I tested the RegEx in a couple of online testers and it worked, but when I tried to implement in Java, it's only recognizing text surrounded by underscores. I have an example that shows the problem here:
private String process(String input) {
String processed = input.replaceAll("(\\b|^)\\_(.*)\\_(\\b|$)", "underscore")
.replaceAll("(\\b|^)\\*(.*)\\*(\\b|$)", "star")
.replaceAll("(\\b|^)```(.*)```(\b|$)", "backticks")
.replaceAll("(\\b|^)\\~(.*)\\~(\\b|$)", "tilde")
.replaceAll("(\\b|^)\\`(.*)\\`(\\b|$)", "tick")
.replaceAll("(\\b|^)\\\\\\((.*)\\\\\\)(\\b|$)", "backslashparen")
.replaceAll("\\*", "%"); // am I matching stars wrong?
return processed;
}
public void test() {
String example = "_Text_\n" +
"*text*\n" +
"~Text~\n" +
"`Text`\n" +
"_Text_\n" + // is it only matching the first one?
"``` Text ```\n" +
"\\(Text\\)\n" +
"~Text~\n";
System.out.println(process(example));
}
I expect all the lines would match and be replaced, but only the first line was matched. I wondered if it was because it was the first line, so I copied it in the middle and it matched both. Then I figured I might have missed something matching the special characters, so I added the snip to match the astericks and replace with a percent sign and it worked. The output I'm getting is like so:
underscore
%text%
~Text~
`Text`
underscore
``` Text ```
\(Text\)
~Text~
Any ideas what I might be missing?
Thanks.

If you're using word boundaries then there is no need to match anchors in alternation because word boundary also matches start and end positions. So this are actually redundant matches:
(?:^|\b)
(?:\b|$)
and both can be just be replaced by \b.
However looking at your regex please note that only underscore is considered a word character and *, ~, ` are not word characters hence \b cannot be used around those characters instead \B should be used which is inverse of \b.
Besides this some more improvements can be done like using a negated character class instead of greedy .* and removing unnecessary group.
Code:
class MyRegex {
public static void main (String[] args) {
String example = "_Text_\n" +
"*text*\n" +
"~Text~\n" +
"`Text`\n" +
"_Text_\n" + // is it only matching the first one?
"``` Text ```\n" +
"\\(Text\\)\n" +
"~Text~\n";
System.out.println(process(example));
}
private static String process(String input) {
String processed = input.replaceAll("\\b_[^_]+_\\b", "underscore")
.replaceAll("\\B\\*[^*]+\\*\\B", "star")
.replaceAll("\\B```.+?```\\B", "backticks")
.replaceAll("\\B~[^~]+~\\B", "tilde")
.replaceAll("\\B`[^`]+`\\B", "tick")
.replaceAll("\\B\\\\\\(.*?\\\\\\)\\B", "backslashparen");
return processed;
}
}
Code Demo

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Using a regex to match a word ending in a comma but not within another word

I want to use a regex to achieve two objectives: match a string only when it is a complete word (don't match "on" inside of "contact"), and match strings that end with a comma or period.
This is an example. It is meant to find the string (str2) in str and replace it with the same string surrounded by parenthesis.
while(scan2.hasNext()) {
    String str2 = scan2.next();
    str = str.replaceAll("\\b" + str2 + "\\b", "(" + str2 + ")");
}
It does avoid matching strings within words, but it ignores strings that end in a comma or period.
How would I do this?
public class Main {
public static void main(String[] args) {
System.out.println(replace("upon contact", "on"));
System.out.println(replace("upon contact,", "contact"));
System.out.println(replace("upon contact", "contact"));
}
private static String replace(String s1, String s2) {
return s1.replaceAll(String.format("\\b(%s)\\b(?=[.,])", s2), "\\($1\\)");
}
}
upon contact // matches only complete words
upon (contact), // replaces match with (match)
upon contact // only matches if ends with , or .
The following regex matches string ending with comma/period or string composed by a single complete word:
(?s)(^(?<A>\b\w+\b)$)|((?s)^(?<B>.+(?<=[,.]))$)
See also https://regex101.com/r/E78rQV/1/ for more explanations.
I took the liberty of adding exclamation point and question mark.
Brackets means it will match for any of the characters inside the brackets.
str = str.replaceAll("\\b" + str2 + "[\\b.,!?]", "(" + str2 + ")");

How can I replace everything between 2nd "." and “:” in java?

Been researching online but haven't been able to find a solution.
I've got the following string '555.8.0.i5:790.2.0.i19:904.1.0:8233.2:' in Java.
Whats the best way I can remove everything from and including the second dot to the colon?
I want the string to end up looking like this: 555.8:790.2:904.1:8233.2:
I saw on another post someone had referenced the second dot with java regex (\d+.\d.) but I'm not sure how to do the trim.
EDIT:
I have tried the following java regex .replaceAll("\\.(.*?):", ":"); but it seems to remove everything from the first dot. Not sure how to get it to trim from the second dot.
In your case, you may use
.replaceAll("(\\.[^:.]+)\\.[^:]+", "$1")
See the regex demo
Details:
(\\.[^:.]+) - Capture group 1 capturing a dot and 1+ chars other than a literal dot and colon
\\. - a literal dot
[^:]+ - 1+ chars other than a colon.
In the replacement pattern, only a $1 backreference to the value captured in Group 1 is used.
Do you have to use regex? Here is a solution using Java:
public static void main(String[] args) {
String myString = "555.8.0.i5:790.2.0.i19:904.1.0:8233.2:";
StringBuilder sb = new StringBuilder();
//Split the string into an array of strings at each colon
String[] stringParts = myString.split(":");
//Loop over each substring
for (String stringPart : stringParts) {
//Find the index of the second dot
int secondDotIndex = stringPart.indexOf('.', 1 + stringPart.indexOf('.', 1));
//If a second dot exists then remove everything after and including the dot
if (secondDotIndex != -1) {
stringPart = stringPart.substring(0, secondDotIndex);
}
//Append each string part and colon back to the final string
sb.append(stringPart);
sb.append(":");
}
System.out.println(sb.toString());
}
The final println prints 555.8:790.2:904.1:8233.2:

Java regular expression lookahead

I have strings that I need to use regex to replace a specific character. The strings are in the following format:
"abc.edf" : "abc.abc", "ghi.ghk" : "bbb.bbb" , "qwq.tyt" : "ddd.ddd"
I need to replace the periods, '.', that are between the strings in quotes before the colon but not the strings in quotes after the colon and before the comma. Could someone shed some light?
This pattern will match the entire part that you want to touch: "\w{3}\.\w{3}" : "\w{3}\.\w{3}". Since it includes the colon and the values on both side, it won't match ones where there is a comma between the values. Depending on your needs, you may need to change \w to some other character class.
But, as I'm sure you are aware, you don't want to replace the entire string. You only want to replace the one character. There are two ways to do that. You can either use look-aheads and look-behinds to exclude everything else except the period from the resulting match:
Pattern: (?<="\w{3})\.(?=\w{3}" : "\w{3}\.\w{3}")
Replacement: :
Or, if the look-aheads and look-behinds confuse you, you could just capture the whole thing and include the original values from the captured groups in the replacement value:
Pattern: ("\w{3})\.(\w{3}" : "\w{3}\.\w{3}")
Replacement: $1:$2
Try with the following patern: /.(?=[a-z]+)/g
Working regex-demo for substitution # regex101
Java Working Demo:
public class StackOverFlow31520446 {
public static String text;
public static String pattern;
public static String replacement;
static {
text = "\"abc.edf\" : \"123.231\", \"ghi.ghk\" : \"456.678\" , \"qwq.tyt\" : \"141.242\"";
pattern = "\\.(?=[a-z]+)";
replacement = ";";
}
public static String replaceMatches(String text, String pattern, String replacement) {
return text.replaceAll(pattern, replacement);
}
public static void main(String[] args) {
System.out.println(replaceMatches(text, pattern, replacement));
}
}
Not sure what you intend to do with the string but this is a way to
match the contents of the quote's.
The contents are in capture buffer 1.
You could use a callback to replace the dots within the
contents, passing that back within the main replacement function.
Find: "([^"]*\.[^"]*)"(?=\s*:)
Replace: " + func( call to replace dots from capt buff 1 ) + "
Formatted:
" # Open quote
( [^"]* \. [^"]* ) # (1), group 1 - contents
" # Close quote
(?= # Lookahead, must be a colon
\s*
:
)
If would go for a different approach (maybe it is even faster). In your loop over all strings first try if the string matches a number \d*\.?\d* - if not, do the replacement of . with : (without any regexp).
Would that solve your problem?
You can do it without look arounds:
str = str.replaceAll("(\\D)\\.(\\D)", "$1:$2");
should be sufficient for the task.

Regex to match only commas not in parentheses?

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?
Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.
Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...
I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Categories