I have a java toString on code generated from XML . We as a company are logging the toString() to logs and I am having trouble making a good regex to mask all the data effectively .
Here is the sample to String
String input="com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]";
expected output
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=****************, clientId=12345]
Can someone help me with a regex that will mask everything up until the last comma(,) before the next equal =
here is what I tried
maskPatterns.add("clientName=(.*?)=");
This ends up masking till next = . I cant seem to figure how to have it backtrack to last comma(,) before next equal(=).
Also if anyone has better regex for it I am all ears
You can use
clientName=(.*?)(?=\s*,\s*\w+=|\])
See the regex demo
Details
clientName= - a literal string
(.*?) - Group 1: any zero or more chars other than line break chars as few as possible
(?=\s*,\s*\w+=|\]) - a positive lookahead that requires either ] (\] or (|) a comma enclosed with zero or more whitespaces on both ends (\s*,\s*), then one or more word chars and = immediately to the right of the current location.
Or, if you need the same amount of asterisks, use
String result = text.replaceAll("(\\G(?!^)|clientName=).(?=.*?,\\s*\\w+=|\\])", "$1*");
See this regex demo.
Details
(\\G(?!^)|clientName=)
. - any char but a line break char
(?=.*?,\s*\w+=|\]) - up to the first occurrence of
.*?,\s*\w+= - any zero or more chars other than line break chars as few as possible, a comma, zero or more whitespaces, one or more word chars and a =
| - or
\] - a ] char.
Use String#replaceAll here:
String input = "com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]";
String output = input.replaceAll("\\bclientName=.*?(\\s*)(?=\\w+=|\\])", "clientName=****************$1");
System.out.println(input);
System.out.println(output);
This prints:
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=HARVARD LAW SCHOOL, THE, clientId=12345]
com.example.sensitive.info.UserInfo#15b1534[name=User1, clientName=**************** clientId=12345]
Note that the number of asterisks probably should not exactly match the number of original characters in the clientName. Doing so would actually be partially revealing the original content, insofar that it would reveal at least the original length of the clientName string.
According to your example of maskPatterns.add("clientName=(.*?)="); I assume that you want the value in capture group 1.
If it should be agnostic of the square brackets for marking the end of the value, but you don't want to match them either, you might use:
\bclientName=([^\r\n,=\[\]]+(?:,(?!\h*\w+=)[^\r\n,=\[\]]*)*)
Explanation
\bclientName= A word boundary, then match clientName=
( Capture group 1
[^\r\n,=\[\]]+ Match 1+ times any char except , = [ ] or a newline
(?: Non capture group
,(?!\h*\w+=) Match a comma asserting what is directly to the right is not 0+ horizontal whitespace chars, 1+ word chars and an = sign
[^\r\n,=\[\]]* Optionally match any char except a newline , = [ ]
)* Close non capture group and repeat 0+ times to get all occurrences of a comma
) Close group 1
Regex demo
If the [ and ] can also be part of the clientName, you can omit them from the character classes.
Related
I have some URL link and tried to replace all non-integer values with integers in the end of the link using regex
The URL is something like
https://some.storage.com/test123456.bucket.com/folder/80.png
Regex i tried to use:
Integer.parseInt(string.replaceAll(".*[^\\d](\\d+)", "$1"))
Output for that regex is "80.png", and i need only "80". Also i tried this tool - https://regex101.com. And as i see the main problem is that ".png" not matching with my regex and then, after substitution, this part adding to matching group.
I'm totally noob in regex, so i kindly ask you for help.
You may use
String result = string.replaceAll("(?:.*\\D)?(\\d+).*", "$1");
See the regex demo.
NOTE: If there is no match, the result will be equal to the string value. If you do not want this behavior, instead of "(?:.*\\D)?(\\d+).*", use "(?:.*\\D)?(\\d+).*|.+".
Details
(?:.*\D)? - an optional (it must be optional because the Group 1 pattern might also be matched at the start of the string) sequence of
.* - any 0+ chars other than line break chars, as many as possible
\D - a non-digit
(\d+) - Group 1: any one or more digits
.* - any 0+ chars other than line break chars, as many as possible
The replacement is $1, the backreference to Group 1 value, actually, the last 1+ digit chunk in the string that has no line breaks.
Line breaks can be supported if you prepend the pattern with the (?s) inline DOTALL modifier, i.e. "(?s)(?:.*\\D)?(\\d+).*|.+".
Given the following strings (stringToTest):
G2:7JAPjGdnGy8jxR8[RQ:1,2]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
G2:7JAPjGdnGy8jxR8[RQ:3,4]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
And the Pattern:
Pattern p = Pattern.compile("G2:\\S+RQ:3,4");
if (p.matcher(stringToTest).find())
{
// Match
}
For string 1 I DON'T want to match, because RQ:3,4 is associated with the G3 section, not G2, and I want string 2 to match, as RQ:3,4 is associated with G2 section.
The problem with the current regex is that it's searching too far and reaching the RQ:3,4 eventually in case 1 even though I don't want to consider past the G2 section.
It's also possible that the stringToTest might be (just one section):
G2:7JAPjGdnGy8jxR8[RQ:3,4]
The strings 7JAPjGdnGy8jxR8 and jRo6pN8ZW9aglYz are variable length hashes.
Can anyone help me with the correct regex to use, to start looking at G2 for RQ:3,4 but stopping if it reaches the end of the string or -G (the start of the next section).
You may use this regex with a negative lookahead in between:
G2:(?:(?!G\d+:)\S)*RQ:3,4
RegEx Demo
RegEx Details:
G2:: Match literal text G2:
(?: Start a non-capture group
(?!G\d+:): Assert that we don't have a G<digit>: ahead of us
\S: Match a non-whitespace character
)*: End non-capture group. Match 0 or more of this
RQ:3,4: Match literal text RQ:3,4
In Java use this regex:
String re = "G2:(?:(?!G\\d+:)\\S)*RQ:3,4";
The problem is that \S matches any whitespace char and the regex engine parses the text from left to right. Once it finds G2: it grabs all non-whitespaces to the right (since \S* is a ghreedy subpattern) and then backtracks to find the rightmost occurrence of RQ:3,4.
In a general case, you may use
String regex = "G2:(?:(?!-G)\\S)*RQ:3,4";
See the regex demo. (?:(?!-G)\S)* is a tempered greedy token that will match 0+ occurrences of a non-whitespace char that does not start a -G substring.
If the hyphen is only possible in front of the next section, you may subtract - from \S:
String regex = "G2:[^\\s-]*RQ:3,4"; // using a negated character class
String regex = "G2:[\\S&&[^-]]*RQ:3,4"; // using character class subtraction
See this regex demo. [^\\s-]* will match 0 or more chars other than whitespace and -.
Try to use [^[] instead of \S in this regex: G2:[^[]*\[RQ:3,4
[^[] means any character but [
Demo
(considering that strings like this: G2:7JAP[jGd]nGy8[]R8[RQ:3,4] are not possible)
How to generate a regex to match only one word which starts with big
I have tried to form a regex with start and end string. Starting string as big and ending string as \s space.
Consider this line You are my big-big-big friend and also a brother
When i use the below regex, it gives me result as big-big-bigfriendandalsoabrother
(.big.*\s)
But i am expecting result as big-big-big. The word can be at starting of line or at the end. I want to generate a regex to match the full word which starts with big
Help would be appreciated.
The following regex may be used:
(?<!\S)big\S*
Details:
(?<!\S) - a negative lookbehind that makes sure there is start of string or a whitespace immediately to the left of the current location
big - a literal substring
\S* - any 0 or more chars other than whitespace chars
You can use the Regex
(?!\s)big\S*
It'll match exactly what you asked for.
Explanation:
(?!\s)
It may or may not have a whitespace behind it, but it shouldn't be counted as part of the capture (negative lookahead)
big
Will find the word big
\S*
Will find any character that's NOT a whitespace, 0 or more times
So:
(?!\s)big\S*
Finds the word big, followed by anything that's not a whitespace, until it hits a whitespace. It may or may not have a whitespace behind.
I have a string with data separated by commas like this:
$d4kjvdf,78953626,10.0,103007,0,132103.8945F,
I tried the following regex but it doesn't match the strings I want:
[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,[a-zA-Z0-9]+\\,
The $ at the beginning of your data string is not matching the regex. Change the first character class to [$a-zA-Z0-9]. And a couple of the comma separated values contain a literal dot. [$.a-zA-Z0-9] would cover both cases. Also, it's probably a good idea to anchor the regex at the start and end by adding ^ and $ to the beginning and end of the regex respectively. How about this for the full regex:
^[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,[$.a-zA-Z0-9]+\\,$
Update:
You said number of commas is your primary matching criteria. If there should be 6 commas, this would work:
^([^,]+,){6}$
That means: match at least 1 character that is anything but a comma, followed by a comma. And perform the aforementioned match 6 times consecutively. Note: your data must end with a trailing comma as is consistent with your sample data.
Well your regular expression is certainly jarbled - there are clearly characters (like $ and .) that your expression won't match, and you don't need to \\ escape ,s. Lets first describe our requirements, you seem to be saying a valid string is defined as:
A string consisting of 6 commas, with one or more characters before each one
We can represent that with the following pattern:
(?:[^,]+,){6}
This says match one or more non-commas, followed by a comma - [^,]+, - six times - {6}. The (?:...) notation is a non-capturing group, which lets us say match the whole sub-expression six times, without it, the {6} would only apply to the preceding character.
Alternately, we could use normal, capturing groups to let us select each individual section of the matching string:
([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?
Now we can not only match the string, but extract its contents at the same time, e.g.:
String str = "$d4kjvdf,78953626,10.0,103007,0,132103.8945F,";
Pattern regex = Pattern.compile(
"([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),?");
Matcher m = regex.matcher(str);
if(m.matches()) {
for (int i = 1; i <= m.groupCount(); i++) {
System.out.println(m.group(i));
}
}
This prints:
$d4kjvdf
78953626
10.0
103007
0
132103.8945F
I'm trying to compare following strings with regex:
#[xyz="1","2"'"4"] ------- valid
#[xyz] ------------- valid
#[xyz="a5","4r"'"8dsa"] -- valid
#[xyz="asd"] -- invalid
#[xyz"asd"] --- invalid
#[xyz="8s"'"4"] - invalid
The valid pattern should be:
#[xyz then = sign then some chars then , then some chars then ' then some chars and finally ]. This means if there is characters after xyz then they must be in format ="XXX","XXX"'"XXX".
Or only #[xyz]. No character after xyz.
I have tried following regex, but it did not worked:
String regex = "#[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"]";
Here the quotations (in part after xyz) are optional and number of characters between quotes are also not fixed and there could also be some characters before and after this pattern like asdadad #[xyz] adadad.
You can use the regex:
#\[xyz(?:="[a-zA-z0-9]+","[a-zA-z0-9]+"'"[a-zA-z0-9]+")?\]
See it
Expressed as Java string it'll be:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
What was wrong with your regex?
[...] defines a character class. When you want to match literal [ and ] you need to escape it by preceding with a \.
[a-zA-z][0-9] match a single letter followed by a single digit. But you want one or more alphanumeric characters. So you need [a-zA-Z0-9]+
Use this:
String regex = "#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\]";
When you write [a-zA-z][0-9] it expects a letter character and a digit after it. And you also have to escape first and last square braces because square braces have special meaning in regexes.
Explanation:
[a-zA-z0-9]+ means alphanumeric character (but not an underline) one or more times.
(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? means that expression in parentheses can be one time or not at all.
Since square brackets have a special meaning in regex, you used it by yourself, they define character classes, you need to escape them if you want to match them literally.
String regex = "#\\[xyz=\"[a-zA-z][0-9]\",\"[a-zA-z][0-9]\"'\"[a-zA-z][0-9]\"\\]";
The next problem is with '"[a-zA-z][0-9]' you define "first a letter, second a digit", you need to join those classes and add a quantifier:
String regex = "#\\[xyz=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\"\\]";
See it here on Regexr
there could also be some characters before and after this pattern like
asdadad #[xyz] adadad.
Regex should be:
String regex = "(.)*#\\[xyz(=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")?\\](.)*";
The First and last (.)* will allow any string before the pattern as you have mentioned in your edit. As said by #ademiban this (=\"[a-zA-z0-9]+\",\"[a-zA-z0-9]+\"'\"[a-zA-z0-9]+\")? will come one time or not at all. Other mistakes are also very well explained by Others +1 to all other.